-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils #29959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Kubernetes integration test starting |
Test build #129479 has finished for PR 29959 at commit
|
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129484 has finished for PR 29959 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing to work on this, some initial questions.
if (ignoreLocality) { | ||
fs.listStatus(path) | ||
} else { | ||
val remoteIter = fs.listLocatedStatus(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance a FS won't have this implemented? as per the previous code's comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah a FS can choose not to implement it (although all the main ones override this). If not implemented it will fall back to the default impl in FileSystem
, which basically calls listStatus
and then getFileBlockLocations
on each FileStatus
received. The behavior is very similar to what this class is doing later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDFS and S3A both do this; ABFS merits minor optimisation too. Because they return a remote iterator they can do paged fetch of data
- HDFS/webHDFS: paged download for better scalability
- S3A (3.3.1+): async prefetch of next page of data
ABFS should copy the S3A approach; it's listing API is paged too.
Better to rely on the FS to do the work but make clear you expect the maintainers to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @steveloughran , yes I also think it's better to rely on the FileSystem-specific listLocatedStatus
impl rather than having the logic here. However, the change seems to break a few assumptions in the test cases so I'll isolate it into another PR.
// listLocatedStatus will fail as a whole because the default impl calls | ||
// getFileBlockLocations | ||
assert(leafFiles.isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to indicate the change needs some work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this test checks the case where a file was deleted after a listStatus
call but before a subsequent getFileBlockLocations
when locality info is needed. With the new impl, we'd call listLocatedStatus
instead which will call getFileBlockLocations
internally, and thus the listLocatedStatus
call (as a whole) fails with FileNotFoundException
.
As explained in the PR description, the behavior will be different when spark.sql.files.ignoreMissingFiles
is set, although I think we currently don't give any guarantee when there is missing files during listing, so either is acceptable? anyway, I'm happy to remove this change if there is any concern.
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #129563 has finished for PR 29959 at commit
|
Thanks @holdenk for the review. Yes this PR still needs a bit more work. Will update. |
5e299a2
to
f582b17
Compare
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129571 has finished for PR 29959 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129592 has finished for PR 29959 at commit
|
Jenkins, add to whitelist |
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129604 has finished for PR 29959 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129605 has finished for PR 29959 at commit
|
Test build #129606 has finished for PR 29959 at commit
|
be1517e
to
cb76047
Compare
@holdenk sure - it's done. |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Test build #131175 has finished for PR 29959 at commit
|
Test build #131176 has finished for PR 29959 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #131237 has finished for PR 29959 at commit
|
K8s failures are unrelated, this does not change any of the decommissioning logic. I'll work on a follow up to the decommissioning failures. |
Thanks @holdenk for the review & merge! |
Could you check Hadoop 2.7 failure?
|
thanks @dongjoon-hyun , let me take a look. |
found potential issue and opened #30447 |
…leFileStatus and SerializableBlockLocation for Hadoop 2.7 ### What changes were proposed in this pull request? Revert the change in #29959 and don't remove `SerializableFileStatus` and `SerializableBlockLocation`. ### Why are the changes needed? In Hadoop 2.7 `FileStatus` and `BlockLocation` are not serializable, so we still need the two wrapper classes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30447 from sunchao/SPARK-32381-followup. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
val filteredStatuses = doFilter(statuses) | ||
val allLeafStatuses = { | ||
val (dirs, topLevelFiles) = filteredStatuses.partition(_.isDirectory) | ||
val (dirs, topLevelFiles) = statuses.partition(_.isDirectory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao the dirs
here may contain hidden directories. We still need to filter them before listing leaf files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang you're right. Thanks for catching this! and sorry for introducing this regression.
…ition inference ### What changes were proposed in this pull request? Fix a regression from #29959. In Spark, the following file paths are considered as hidden paths and they are ignored on file reads: 1. starts with "_" and doesn't contain "=" 2. starts with "." However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426 This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName` ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug for reading file paths with partitions. ### How was this patch tested? Unit test Closes #31169 from gengliangwang/fileListingBug. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…ition inference ### What changes were proposed in this pull request? Fix a regression from #29959. In Spark, the following file paths are considered as hidden paths and they are ignored on file reads: 1. starts with "_" and doesn't contain "=" 2. starts with "." However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426 This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName` ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug for reading file paths with partitions. ### How was this patch tested? Unit test Closes #31169 from gengliangwang/fileListingBug. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 467d758) Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR is a follow-up of #29471 and does the following improvements for
HadoopFSUtils
:filterFun
from the listing API and combines it with thefilter
.SerializableBlockLocation
andSerializableFileStatus
given thatBlockLocation
andFileStatus
are already serializable.isRootLevel
flag from the top-level API.Why are the changes needed?
Main purpose is to simplify the logic within
HadoopFSUtils
as well as cleanup the API.Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing unit tests (e.g.,
FileIndexSuite
)