Skip to content

Conversation

@ahmarsuhail
Copy link
Contributor

@ahmarsuhail ahmarsuhail commented Feb 10, 2022

Description of PR

JIRA: https://issues.apache.org/jira/browse/HADOOP-13704

This PR implements an optimised version of getContentSummary which uses the result from the listFiles iterator.

Explanation of new buildDirectorySet method added:

Since the listFiles operation can return the directory a/b/c as a single object, we need to recurse over the path a/b/c to ensure we have counted all directories. We do this by keeping two sets, dirSet (Set of all directories under the base path) and pathTraversed (Set of paths we have recursed over so far).

Iterating over directory structure basePath/a/b/c, basePath/a/b/d, we will first find all the directories in basePath/a/b/c. Once this is completed, the pathTraversed set will have {basePath/a/b} and dirSet will have {basePath/a, basePath/a/b, basePath/a/b/c}.

Then for basePath/a/b/d, just add basePath/a/b/d to the dirSet and don't do any additional work as path basePath/a/b has already been traversed.

The Jira ticket mentions that we should add in some instrumentation to measure usage. T's already code that does this
here and usage is tested in an integration test here .

How was this patch tested?

Tested in eu-west-1 by running

mvn -Dparallel-tests -DtestsThreadCount=16 clean verify

@apache apache deleted a comment from hadoop-yetus Feb 16, 2022
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great.

added some comments, and i've just scanned the test suites to see where a larger dir tree could be tested.

i propose adding something in ITestS3ADirectoryPerformance.testListOperations()

  • get a summary for the test path, then one for root, verify that root numbers >= that of the test dir. and use duration tracker to measure/report duration.


fs.mkdirs(parent);

try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use lambda test utils intercept() here

import java.util.HashSet;
import java.util.Set;

import org.apache.hadoop.fs.s3a.S3ALocatedFileStatus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put all org.apache. imports in their own block under the others. note, some fixup of our move off guava means many of our current files break this rule ... and moving imports around makes cherrypicking harder. so we leave those alone

* @return an iterator
* @throws IOException failure
*/
RemoteIterator<S3AFileStatus> listStatusIterator(Path path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method obsolete now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, have removed

import org.apache.hadoop.fs.contract.AbstractContractContentSummaryTest;
import org.apache.hadoop.fs.contract.AbstractFSContract;
import org.apache.hadoop.fs.s3a.S3AFileSystem;
import org.assertj.core.api.Assertions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move to their own block above this one

Path baseDir = methodPath();

// Nested folders created separately will return as separate objects in listFiles()
fs.mkdirs(new Path(baseDir + "/a"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use

new Path(basedir, "a");

@ahmarsuhail
Copy link
Contributor Author

@steveloughran thanks for the review :) I've made the suggested changes and also updated ITestS3ADirectoryPerformance.testListOperations()

@ahmarsuhail ahmarsuhail force-pushed the HADOOP-13704-optimised-content-summary branch from 945a9d9 to 3f3e1c4 Compare February 21, 2022 10:51
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 47s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 5 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 12m 52s Maven dependency ordering for branch
+1 💚 mvninstall 24m 20s trunk passed
+1 💚 compile 27m 6s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 compile 24m 53s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 20s trunk passed
+1 💚 mvnsite 2m 51s trunk passed
+1 💚 javadoc 1m 47s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 2m 27s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 15s trunk passed
+1 💚 shadedclient 24m 11s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 28s Maven dependency ordering for patch
+1 💚 mvninstall 1m 49s the patch passed
+1 💚 compile 26m 59s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javac 26m 59s the patch passed
+1 💚 compile 24m 25s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 24m 25s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 4m 16s the patch passed
+1 💚 mvnsite 2m 41s the patch passed
+1 💚 javadoc 1m 41s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 2m 36s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 38s the patch passed
+1 💚 shadedclient 24m 13s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 31m 3s hadoop-common in the patch passed.
+1 💚 unit 2m 47s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 58s The patch does not generate ASF License warnings.
261m 31s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3978/6/artifact/out/Dockerfile
GITHUB PR #3978
Optional Tests dupname asflicense mvnsite codespell markdownlint compile javac javadoc mvninstall unit shadedclient spotbugs checkstyle
uname Linux 1634af7db360 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 3f3e1c4
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3978/6/testReport/
Max. process+thread count 1260 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3978/6/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Mar 9, 2022
@steveloughran
Copy link
Contributor

.. not deliberately ignoring this, just falling behind on reviews while i try to get my manifest committer out the door. reviews there welcome, even though it targets abfs and gcs. #2971

i will pull some of this back into the s3a committer afterwards, including stat names and some IO enhancements to get parquet files writing faster (disabling existence/overwrite checks in __magic dirs)

#3289

@apache apache deleted a comment from hadoop-yetus Mar 22, 2022
@apache apache deleted a comment from hadoop-yetus Mar 22, 2022
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. +1 on this, merging locally with a couple of text changes (javadocs of operation, filesystem.md) into trunk and branch-3.3

one failure in the itest run, unrelated, files https://issues.apache.org/jira/browse/HADOOP-18168

@steveloughran
Copy link
Contributor

patch is in, though i realise i forgot to add the pr# to the header, which is automatic when done through the UI. never mind, at least the JIRA Is in. closing this PR as done

@apache apache deleted a comment from hadoop-yetus Mar 22, 2022
@ahmarsuhail ahmarsuhail deleted the HADOOP-13704-optimised-content-summary branch October 7, 2022 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants