-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-13704. Optimised getContentSummary() #3978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-13704. Optimised getContentSummary() #3978
Conversation
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great.
added some comments, and i've just scanned the test suites to see where a larger dir tree could be tested.
i propose adding something in ITestS3ADirectoryPerformance.testListOperations()
- get a summary for the test path, then one for root, verify that root numbers >= that of the test dir. and use duration tracker to measure/report duration.
|
|
||
| fs.mkdirs(parent); | ||
|
|
||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use lambda test utils intercept() here
| import java.util.HashSet; | ||
| import java.util.Set; | ||
|
|
||
| import org.apache.hadoop.fs.s3a.S3ALocatedFileStatus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put all org.apache. imports in their own block under the others. note, some fixup of our move off guava means many of our current files break this rule ... and moving imports around makes cherrypicking harder. so we leave those alone
...tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/GetContentSummaryOperation.java
Show resolved
Hide resolved
| * @return an iterator | ||
| * @throws IOException failure | ||
| */ | ||
| RemoteIterator<S3AFileStatus> listStatusIterator(Path path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this method obsolete now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, have removed
| import org.apache.hadoop.fs.contract.AbstractContractContentSummaryTest; | ||
| import org.apache.hadoop.fs.contract.AbstractFSContract; | ||
| import org.apache.hadoop.fs.s3a.S3AFileSystem; | ||
| import org.assertj.core.api.Assertions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move to their own block above this one
| Path baseDir = methodPath(); | ||
|
|
||
| // Nested folders created separately will return as separate objects in listFiles() | ||
| fs.mkdirs(new Path(baseDir + "/a")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use
new Path(basedir, "a");
|
@steveloughran thanks for the review :) I've made the suggested changes and also updated |
945a9d9 to
3f3e1c4
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
.. not deliberately ignoring this, just falling behind on reviews while i try to get my manifest committer out the door. reviews there welcome, even though it targets abfs and gcs. #2971 i will pull some of this back into the s3a committer afterwards, including stat names and some IO enhancements to get parquet files writing faster (disabling existence/overwrite checks in __magic dirs) |
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. +1 on this, merging locally with a couple of text changes (javadocs of operation, filesystem.md) into trunk and branch-3.3
one failure in the itest run, unrelated, files https://issues.apache.org/jira/browse/HADOOP-18168
|
patch is in, though i realise i forgot to add the pr# to the header, which is automatic when done through the UI. never mind, at least the JIRA Is in. closing this PR as done |
Description of PR
JIRA: https://issues.apache.org/jira/browse/HADOOP-13704
This PR implements an optimised version of getContentSummary which uses the result from the listFiles iterator.
Explanation of new
buildDirectorySetmethod added:Since the listFiles operation can return the directory
a/b/cas a single object, we need to recurse over the patha/b/cto ensure we have counted all directories. We do this by keeping two sets, dirSet (Set of all directories under the base path) and pathTraversed (Set of paths we have recursed over so far).Iterating over directory structure
basePath/a/b/c,basePath/a/b/d, we will first find all the directories inbasePath/a/b/c. Once this is completed, the pathTraversed set will have{basePath/a/b}and dirSet will have{basePath/a, basePath/a/b, basePath/a/b/c}.Then for
basePath/a/b/d, just addbasePath/a/b/dto the dirSet and don't do any additional work as pathbasePath/a/bhas already been traversed.The Jira ticket mentions that we should add in some instrumentation to measure usage. T's already code that does this
here and usage is tested in an integration test here .
How was this patch tested?
Tested in eu-west-1 by running
mvn -Dparallel-tests -DtestsThreadCount=16 clean verify