[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils #29959

sunchao · 2020-10-06T23:48:08Z

What changes were proposed in this pull request?

This PR is a follow-up of #29471 and does the following improvements for HadoopFSUtils:

Removes the extra filterFun from the listing API and combines it with the filter.
Removes SerializableBlockLocation and SerializableFileStatus given that BlockLocation and FileStatus are already serializable.
Hides the isRootLevel flag from the top-level API.

Why are the changes needed?

Main purpose is to simplify the logic within HadoopFSUtils as well as cleanup the API.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests (e.g., FileIndexSuite)

SparkQA · 2020-10-07T00:32:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34085/

SparkQA · 2020-10-07T00:40:01Z

Test build #129479 has finished for PR 29959 at commit 6f7dc79.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-07T00:49:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34085/

SparkQA · 2020-10-07T02:48:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34090/

SparkQA · 2020-10-07T03:06:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34090/

SparkQA · 2020-10-07T04:22:28Z

Test build #129484 has finished for PR 29959 at commit 8c8aa81.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Thanks for continuing to work on this, some initial questions.

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

holdenk · 2020-10-08T18:28:43Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

+      if (ignoreLocality) {
+        fs.listStatus(path)
+      } else {
+        val remoteIter = fs.listLocatedStatus(path)


Is there a chance a FS won't have this implemented? as per the previous code's comment.

yeah a FS can choose not to implement it (although all the main ones override this). If not implemented it will fall back to the default impl in FileSystem, which basically calls listStatus and then getFileBlockLocations on each FileStatus received. The behavior is very similar to what this class is doing later on.

HDFS and S3A both do this; ABFS merits minor optimisation too. Because they return a remote iterator they can do paged fetch of data

HDFS/webHDFS: paged download for better scalability

S3A (3.3.1+): async prefetch of next page of data

ABFS should copy the S3A approach; it's listing API is paged too.

Better to rely on the FS to do the work but make clear you expect the maintainers to do so

Thanks @steveloughran , yes I also think it's better to rely on the FileSystem-specific listLocatedStatus impl rather than having the logic here. However, the change seems to break a few assumptions in the test cases so I'll isolate it into another PR.

holdenk · 2020-10-08T19:01:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

+              // listLocatedStatus will fail as a whole because the default impl calls
+              // getFileBlockLocations
+              assert(leafFiles.isEmpty)


This seems to indicate the change needs some work.

Yes this test checks the case where a file was deleted after a listStatus call but before a subsequent getFileBlockLocations when locality info is needed. With the new impl, we'd call listLocatedStatus instead which will call getFileBlockLocations internally, and thus the listLocatedStatus call (as a whole) fails with FileNotFoundException.

As explained in the PR description, the behavior will be different when spark.sql.files.ignoreMissingFiles is set, although I think we currently don't give any guarantee when there is missing files during listing, so either is acceptable? anyway, I'm happy to remove this change if there is any concern.

SparkQA · 2020-10-08T19:12:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34169/

SparkQA · 2020-10-08T19:33:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34169/

SparkQA · 2020-10-08T20:57:54Z

Test build #129563 has finished for PR 29959 at commit 5e299a2.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-10-08T21:06:01Z

Thanks @holdenk for the review. Yes this PR still needs a bit more work. Will update.

SparkQA · 2020-10-09T03:18:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34177/

SparkQA · 2020-10-09T03:34:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34177/

SparkQA · 2020-10-09T05:08:04Z

Test build #129571 has finished for PR 29959 at commit f582b17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-09T18:52:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34195/

SparkQA · 2020-10-09T19:10:42Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34195/

SparkQA · 2020-10-09T20:38:25Z

Test build #129592 has finished for PR 29959 at commit e10e59f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-10-09T22:04:02Z

Jenkins, add to whitelist

sunchao · 2020-10-09T22:40:33Z

retest this please

SparkQA · 2020-10-09T23:03:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34207/

SparkQA · 2020-10-09T23:21:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34207/

SparkQA · 2020-10-09T23:28:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34208/

SparkQA · 2020-10-09T23:45:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34208/

SparkQA · 2020-10-10T00:20:09Z

Test build #129604 has finished for PR 29959 at commit e10e59f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-10T00:36:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34209/

SparkQA · 2020-10-10T00:52:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34209/

SparkQA · 2020-10-10T01:07:49Z

Test build #129605 has finished for PR 29959 at commit e10e59f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-10T02:24:50Z

Test build #129606 has finished for PR 29959 at commit 1b4bfbe.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-11-16T20:14:18Z

@holdenk sure - it's done.

SparkQA · 2020-11-16T20:53:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35775/

SparkQA · 2020-11-16T21:02:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35776/

SparkQA · 2020-11-16T21:16:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35775/

SparkQA · 2020-11-16T21:28:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35776/

SparkQA · 2020-11-16T22:04:12Z

Test build #131175 has finished for PR 29959 at commit be1517e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-16T22:45:49Z

Test build #131176 has finished for PR 29959 at commit cb76047.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T01:24:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35841/

SparkQA · 2020-11-18T01:52:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35841/

SparkQA · 2020-11-18T03:12:02Z

Test build #131237 has finished for PR 29959 at commit e9d399d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2020-11-18T20:37:39Z

K8s failures are unrelated, this does not change any of the decommissioning logic. I'll work on a follow up to the decommissioning failures.

sunchao · 2020-11-18T21:17:52Z

Thanks @holdenk for the review & merge!

dongjoon-hyun · 2020-11-20T16:56:30Z

Hi, @holdenk and @sunchao .

Could you check Hadoop 2.7 failure?

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1609/

[info] - SPARK-24626 parallel file listing in Stats computation *** FAILED *** (2 seconds, 408 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 in stage 21.0 (TID 19) had a not serializable result: org.apache.hadoop.fs.Path
[info] Serialization stack:

sunchao · 2020-11-20T16:59:14Z

thanks @dongjoon-hyun , let me take a look.

sunchao · 2020-11-20T17:10:34Z

found potential issue and opened #30447

…leFileStatus and SerializableBlockLocation for Hadoop 2.7 ### What changes were proposed in this pull request? Revert the change in #29959 and don't remove `SerializableFileStatus` and `SerializableBlockLocation`. ### Why are the changes needed? In Hadoop 2.7 `FileStatus` and `BlockLocation` are not serializable, so we still need the two wrapper classes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30447 from sunchao/SPARK-32381-followup. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

gengliangwang · 2021-01-13T10:54:34Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

-    val filteredStatuses = doFilter(statuses)
    val allLeafStatuses = {
-      val (dirs, topLevelFiles) = filteredStatuses.partition(_.isDirectory)
+      val (dirs, topLevelFiles) = statuses.partition(_.isDirectory)


@sunchao the dirs here may contain hidden directories. We still need to filter them before listing leaf files.

@gengliangwang you're right. Thanks for catching this! and sorry for introducing this regression.

…ition inference ### What changes were proposed in this pull request? Fix a regression from #29959. In Spark, the following file paths are considered as hidden paths and they are ignored on file reads: 1. starts with "_" and doesn't contain "=" 2. starts with "." However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426 This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName` ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug for reading file paths with partitions. ### How was this patch tested? Unit test Closes #31169 from gengliangwang/fileListingBug. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…ition inference ### What changes were proposed in this pull request? Fix a regression from #29959. In Spark, the following file paths are considered as hidden paths and they are ignored on file reads: 1. starts with "_" and doesn't contain "=" 2. starts with "." However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426 This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName` ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug for reading file paths with partitions. ### How was this patch tested? Unit test Closes #31169 from gengliangwang/fileListingBug. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 467d758) Signed-off-by: HyukjinKwon <[email protected]>

holdenk reviewed Oct 8, 2020

View reviewed changes

sunchao force-pushed the hadoop-fs-utils-followup branch from 5e299a2 to f582b17 Compare October 9, 2020 02:24

sunchao changed the title ~~[WIP][SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils~~ [SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils Oct 10, 2020

sunchao added 5 commits November 16, 2020 12:13

Remove extra path filter

64d3645

Remove SerializableBlockLocation and SerializableFileStatus

c679d33

Hide isRootLevel parameter

0df061c

Fix lint

76f5e23

Fix a silly mistake

cb76047

sunchao force-pushed the hadoop-fs-utils-followup branch from be1517e to cb76047 Compare November 16, 2020 20:13

github-actions bot added the CORE label Nov 16, 2020

Empty commit to trigger CI again

e9d399d

asfgit closed this in 27cd945 Nov 18, 2020

sunchao mentioned this pull request Nov 20, 2020

[SPARK-32381][CORE][FOLLOWUP][test-hadoop2.7] Don't remove SerializableFileStatus and SerializableBlockLocation for Hadoop 2.7 #30447

Closed

gengliangwang mentioned this pull request Jan 13, 2021

[SPARK-34075][SQL][CORE] Hidden directories are being listed for partition inference #31169

Closed

gengliangwang reviewed Jan 13, 2021

View reviewed changes

[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils #29959

[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils #29959

Uh oh!

Conversation

sunchao commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 7, 2020

Uh oh!

SparkQA commented Oct 7, 2020

Uh oh!

SparkQA commented Oct 7, 2020

Uh oh!

SparkQA commented Oct 7, 2020

Uh oh!

SparkQA commented Oct 7, 2020

Uh oh!

SparkQA commented Oct 7, 2020

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 8, 2020

Uh oh!

SparkQA commented Oct 8, 2020

Uh oh!

SparkQA commented Oct 8, 2020

Uh oh!

sunchao commented Oct 8, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

viirya commented Oct 9, 2020

Uh oh!

sunchao commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 10, 2020

Uh oh!

SparkQA commented Oct 10, 2020

Uh oh!

SparkQA commented Oct 10, 2020

Uh oh!

SparkQA commented Oct 10, 2020

Uh oh!

SparkQA commented Oct 10, 2020

sunchao commented Oct 6, 2020 •

edited

Loading

dongjoon-hyun commented Nov 20, 2020 •

edited

Loading