Skip to content

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Oct 6, 2020

What changes were proposed in this pull request?

This PR is a follow-up of #29471 and does the following improvements for HadoopFSUtils:

  1. Removes the extra filterFun from the listing API and combines it with the filter.
  2. Removes SerializableBlockLocation and SerializableFileStatus given that BlockLocation and FileStatus are already serializable.
  3. Hides the isRootLevel flag from the top-level API.

Why are the changes needed?

Main purpose is to simplify the logic within HadoopFSUtils as well as cleanup the API.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests (e.g., FileIndexSuite)

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34085/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Test build #129479 has finished for PR 29959 at commit 6f7dc79.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34085/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34090/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34090/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Test build #129484 has finished for PR 29959 at commit 8c8aa81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to work on this, some initial questions.

if (ignoreLocality) {
fs.listStatus(path)
} else {
val remoteIter = fs.listLocatedStatus(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance a FS won't have this implemented? as per the previous code's comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah a FS can choose not to implement it (although all the main ones override this). If not implemented it will fall back to the default impl in FileSystem, which basically calls listStatus and then getFileBlockLocations on each FileStatus received. The behavior is very similar to what this class is doing later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDFS and S3A both do this; ABFS merits minor optimisation too. Because they return a remote iterator they can do paged fetch of data

  • HDFS/webHDFS: paged download for better scalability
  • S3A (3.3.1+): async prefetch of next page of data

ABFS should copy the S3A approach; it's listing API is paged too.

Better to rely on the FS to do the work but make clear you expect the maintainers to do so

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steveloughran , yes I also think it's better to rely on the FileSystem-specific listLocatedStatus impl rather than having the logic here. However, the change seems to break a few assumptions in the test cases so I'll isolate it into another PR.

Comment on lines 217 to 219
// listLocatedStatus will fail as a whole because the default impl calls
// getFileBlockLocations
assert(leafFiles.isEmpty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to indicate the change needs some work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this test checks the case where a file was deleted after a listStatus call but before a subsequent getFileBlockLocations when locality info is needed. With the new impl, we'd call listLocatedStatus instead which will call getFileBlockLocations internally, and thus the listLocatedStatus call (as a whole) fails with FileNotFoundException.

As explained in the PR description, the behavior will be different when spark.sql.files.ignoreMissingFiles is set, although I think we currently don't give any guarantee when there is missing files during listing, so either is acceptable? anyway, I'm happy to remove this change if there is any concern.

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34169/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34169/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Test build #129563 has finished for PR 29959 at commit 5e299a2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Oct 8, 2020

Thanks @holdenk for the review. Yes this PR still needs a bit more work. Will update.

@sunchao sunchao force-pushed the hadoop-fs-utils-followup branch from 5e299a2 to f582b17 Compare October 9, 2020 02:24
@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34177/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34177/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129571 has finished for PR 29959 at commit f582b17.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34195/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34195/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129592 has finished for PR 29959 at commit e10e59f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Oct 9, 2020

Jenkins, add to whitelist

@sunchao
Copy link
Member Author

sunchao commented Oct 9, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34207/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34207/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34208/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34208/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129604 has finished for PR 29959 at commit e10e59f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34209/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34209/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129605 has finished for PR 29959 at commit e10e59f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129606 has finished for PR 29959 at commit 1b4bfbe.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao changed the title [WIP][SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils [SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils Oct 10, 2020
@sunchao sunchao force-pushed the hadoop-fs-utils-followup branch from be1517e to cb76047 Compare November 16, 2020 20:13
@github-actions github-actions bot added the CORE label Nov 16, 2020
@sunchao
Copy link
Member Author

sunchao commented Nov 16, 2020

@holdenk sure - it's done.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35775/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35776/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35775/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35776/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131175 has finished for PR 29959 at commit be1517e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131176 has finished for PR 29959 at commit cb76047.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35841/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35841/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131237 has finished for PR 29959 at commit e9d399d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Nov 18, 2020

K8s failures are unrelated, this does not change any of the decommissioning logic. I'll work on a follow up to the decommissioning failures.

@asfgit asfgit closed this in 27cd945 Nov 18, 2020
@sunchao
Copy link
Member Author

sunchao commented Nov 18, 2020

Thanks @holdenk for the review & merge!

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 20, 2020

Hi, @holdenk and @sunchao .

Could you check Hadoop 2.7 failure?

[info] - SPARK-24626 parallel file listing in Stats computation *** FAILED *** (2 seconds, 408 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 in stage 21.0 (TID 19) had a not serializable result: org.apache.hadoop.fs.Path
[info] Serialization stack:

@sunchao
Copy link
Member Author

sunchao commented Nov 20, 2020

thanks @dongjoon-hyun , let me take a look.

@sunchao
Copy link
Member Author

sunchao commented Nov 20, 2020

found potential issue and opened #30447

dongjoon-hyun pushed a commit that referenced this pull request Nov 21, 2020
…leFileStatus and SerializableBlockLocation for Hadoop 2.7

### What changes were proposed in this pull request?

Revert the change in #29959 and don't remove `SerializableFileStatus` and `SerializableBlockLocation`.

### Why are the changes needed?

In Hadoop 2.7 `FileStatus` and `BlockLocation` are not serializable, so we still need the two wrapper classes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #30447 from sunchao/SPARK-32381-followup.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
val filteredStatuses = doFilter(statuses)
val allLeafStatuses = {
val (dirs, topLevelFiles) = filteredStatuses.partition(_.isDirectory)
val (dirs, topLevelFiles) = statuses.partition(_.isDirectory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao the dirs here may contain hidden directories. We still need to filter them before listing leaf files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang you're right. Thanks for catching this! and sorry for introducing this regression.

HyukjinKwon pushed a commit that referenced this pull request Jan 14, 2021
…ition inference

### What changes were proposed in this pull request?

Fix a regression from #29959.

In Spark, the following file paths are considered as hidden paths and they are ignored on file reads:
1. starts with "_" and doesn't contain "="
2. starts with "."

However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426

This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName`

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a bug for reading file paths with partitions.

### How was this patch tested?

Unit test

Closes #31169 from gengliangwang/fileListingBug.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jan 14, 2021
…ition inference

### What changes were proposed in this pull request?

Fix a regression from #29959.

In Spark, the following file paths are considered as hidden paths and they are ignored on file reads:
1. starts with "_" and doesn't contain "="
2. starts with "."

However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426

This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName`

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a bug for reading file paths with partitions.

### How was this patch tested?

Unit test

Closes #31169 from gengliangwang/fileListingBug.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 467d758)
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants