HADOOP-13704. Optimised getContentSummary() #3978

ahmarsuhail · 2022-02-10T15:21:19Z

Description of PR

JIRA: https://issues.apache.org/jira/browse/HADOOP-13704

This PR implements an optimised version of getContentSummary which uses the result from the listFiles iterator.

Explanation of new buildDirectorySet method added:

Since the listFiles operation can return the directory a/b/c as a single object, we need to recurse over the path a/b/c to ensure we have counted all directories. We do this by keeping two sets, dirSet (Set of all directories under the base path) and pathTraversed (Set of paths we have recursed over so far).

Iterating over directory structure basePath/a/b/c, basePath/a/b/d, we will first find all the directories in basePath/a/b/c. Once this is completed, the pathTraversed set will have {basePath/a/b} and dirSet will have {basePath/a, basePath/a/b, basePath/a/b/c}.

Then for basePath/a/b/d, just add basePath/a/b/d to the dirSet and don't do any additional work as path basePath/a/b has already been traversed.

The Jira ticket mentions that we should add in some instrumentation to measure usage. T's already code that does this
here and usage is tested in an integration test here .

How was this patch tested?

Tested in eu-west-1 by running

mvn -Dparallel-tests -DtestsThreadCount=16 clean verify

steveloughran

looks great.

added some comments, and i've just scanned the test suites to see where a larger dir tree could be tested.

i propose adding something in ITestS3ADirectoryPerformance.testListOperations()

get a summary for the test path, then one for root, verify that root numbers >= that of the test dir. and use duration tracker to measure/report duration.

steveloughran · 2022-02-16T18:45:57Z

...p-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractContentSummaryTest.java

+
+        fs.mkdirs(parent);
+
+        try {


use lambda test utils intercept() here

steveloughran · 2022-02-16T18:52:55Z

...tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/GetContentSummaryOperation.java

+import java.util.HashSet;
 import java.util.Set;

+import org.apache.hadoop.fs.s3a.S3ALocatedFileStatus;


nit: put all org.apache. imports in their own block under the others. note, some fixup of our move off guava means many of our current files break this rule ... and moving imports around makes cherrypicking harder. so we leave those alone

...tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/GetContentSummaryOperation.java

steveloughran · 2022-02-16T18:56:36Z

...tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/GetContentSummaryOperation.java

     * @return an iterator
     * @throws IOException failure
     */
    RemoteIterator<S3AFileStatus> listStatusIterator(Path path)


is this method obsolete now?

yup, have removed

steveloughran · 2022-02-16T18:57:05Z

...doop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/ITestS3AContractContentSummary.java

+import org.apache.hadoop.fs.contract.AbstractContractContentSummaryTest;
+import org.apache.hadoop.fs.contract.AbstractFSContract;
+import org.apache.hadoop.fs.s3a.S3AFileSystem;
+import org.assertj.core.api.Assertions;


nit: move to their own block above this one

steveloughran · 2022-02-16T18:58:26Z

...doop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/ITestS3AContractContentSummary.java

+        Path baseDir = methodPath();
+
+        // Nested folders created separately will return as separate objects in listFiles()
+        fs.mkdirs(new Path(baseDir + "/a"));


better to use

new Path(basedir, "a");

ahmarsuhail · 2022-02-17T13:52:45Z

@steveloughran thanks for the review :) I've made the suggested changes and also updated ITestS3ADirectoryPerformance.testListOperations()

hadoop-yetus · 2022-02-21T15:14:51Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 47s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 5 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	12m 52s		Maven dependency ordering for branch
+1 💚	mvninstall	24m 20s		trunk passed
+1 💚	compile	27m 6s		trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚	compile	24m 53s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	4m 20s		trunk passed
+1 💚	mvnsite	2m 51s		trunk passed
+1 💚	javadoc	1m 47s		trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚	javadoc	2m 27s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	4m 15s		trunk passed
+1 💚	shadedclient	24m 11s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 28s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 49s		the patch passed
+1 💚	compile	26m 59s		the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚	javac	26m 59s		the patch passed
+1 💚	compile	24m 25s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	24m 25s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	4m 16s		the patch passed
+1 💚	mvnsite	2m 41s		the patch passed
+1 💚	javadoc	1m 41s		the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚	javadoc	2m 36s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	4m 38s		the patch passed
+1 💚	shadedclient	24m 13s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	31m 3s		hadoop-common in the patch passed.
+1 💚	unit	2m 47s		hadoop-aws in the patch passed.
+1 💚	asflicense	0m 58s		The patch does not generate ASF License warnings.
		261m 31s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3978/6/artifact/out/Dockerfile
GITHUB PR	#3978
Optional Tests	dupname asflicense mvnsite codespell markdownlint compile javac javadoc mvninstall unit shadedclient spotbugs checkstyle
uname	Linux 1634af7db360 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `3f3e1c4`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3978/6/testReport/
Max. process+thread count	1260 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3978/6/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

steveloughran · 2022-03-09T22:42:21Z

.. not deliberately ignoring this, just falling behind on reviews while i try to get my manifest committer out the door. reviews there welcome, even though it targets abfs and gcs. #2971

i will pull some of this back into the s3a committer afterwards, including stat names and some IO enhancements to get parquet files writing faster (disabling existence/overwrite checks in __magic dirs)

#3289

steveloughran

ok. +1 on this, merging locally with a couple of text changes (javadocs of operation, filesystem.md) into trunk and branch-3.3

one failure in the itest run, unrelated, files https://issues.apache.org/jira/browse/HADOOP-18168

steveloughran · 2022-03-22T13:47:02Z

patch is in, though i realise i forgot to add the pr# to the header, which is automatic when done through the UI. never mind, at least the JIRA Is in. closing this PR as done

apache deleted a comment from hadoop-yetus Feb 16, 2022

steveloughran requested changes Feb 16, 2022

View reviewed changes

ahmarsuhail requested a review from steveloughran February 17, 2022 13:52

ahmarsuhail added 9 commits February 21, 2022 10:47

Implements optimised getContentSummary

155156c

fix typos

9d1f809

removes unused import

7bc3db0

adds in S3A contract test

3268bac

fixes checkstyle errors

ae709a6

Updates as per review comments

51d5918

test getContentSummary in testListOperations

fb6ce92

fixes checkstyle errors

35ae9ce

fixes checkstyle errors

3f3e1c4

ahmarsuhail force-pushed the HADOOP-13704-optimised-content-summary branch from 945a9d9 to 3f3e1c4 Compare February 21, 2022 10:51

apache deleted a comment from hadoop-yetus Mar 9, 2022

apache deleted a comment from hadoop-yetus Mar 22, 2022

steveloughran approved these changes Mar 22, 2022

View reviewed changes

steveloughran closed this Mar 22, 2022

apache deleted a comment from hadoop-yetus Mar 22, 2022

ahmarsuhail deleted the HADOOP-13704-optimised-content-summary branch October 7, 2022 08:34

HADOOP-13704. Optimised getContentSummary() #3978

HADOOP-13704. Optimised getContentSummary() #3978

Uh oh!

Conversation

ahmarsuhail commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

How was this patch tested?

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

steveloughran Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

steveloughran Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

ahmarsuhail Feb 17, 2022

Choose a reason for hiding this comment

Uh oh!

steveloughran Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

steveloughran Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

ahmarsuhail commented Feb 17, 2022

Uh oh!

hadoop-yetus commented Feb 21, 2022

Uh oh!

steveloughran commented Mar 9, 2022

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Mar 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ahmarsuhail commented Feb 10, 2022 •

edited

Loading