[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #30411

HeartSaVioR · 2020-11-18T12:44:05Z

What changes were proposed in this pull request?

Two new options, modifiiedBefore and modifiedAfter, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. PartioningAwareFileIndex considers these options during the process of checking for files, just before considering applied PathFilters such as pathGlobFilter. In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written.

Why are the changes needed?

When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code.

Does this PR introduce any user-facing change?

This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option.

Example Usages
Load all CSV files modified after date:
spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()

Load all CSV files modified before date:
spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()

Load all CSV files modified between two dates:
spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()

How was this patch tested?

A handful of unit tests were added to support the positive, negative, and edge case code paths.

It's also live in a handful of our Databricks dev environments. (quoted from @cchighman)

…hen filtering from a batch-based file data source

SparkQA · 2020-11-18T13:31:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35885/

HeartSaVioR · 2020-11-18T13:43:16Z

cc. @maropu @gengliangwang @cloud-fan @bart-samwel @zsxwing @dongjoon-hyun @HyukjinKwon

I've cc-ed all reviewers who left at least one review comment in original PR (#28841)

Just FYI to @cchighman as he's an author of original PR.

HeartSaVioR · 2020-11-18T13:47:24Z

First commit squashed the commits in origin PR. Remaining commits are my own to address my own review comments. So if you've done with the origin PR and it looked good to you, you may probably want to only look at remaining commits.

SparkQA · 2020-11-18T13:47:35Z

Test build #131284 has finished for PR 30411 at commit 088b630.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T13:55:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35885/

…tion message on schema inference

SparkQA · 2020-11-18T14:19:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35886/

SparkQA · 2020-11-18T14:45:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35890/

SparkQA · 2020-11-18T14:49:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35886/

SparkQA · 2020-11-18T15:09:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35890/

SparkQA · 2020-11-18T15:22:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35891/

SparkQA · 2020-11-18T15:45:03Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35891/

SparkQA · 2020-11-18T17:09:51Z

Test build #131281 has finished for PR 30411 at commit 80170b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T18:36:32Z

Test build #131285 has finished for PR 30411 at commit 71456f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T19:10:23Z

Test build #131287 has finished for PR 30411 at commit d216fcb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-18T21:14:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PathFilterSuite.scala

+    }
+  }
+
+  test("Option pathGlobFilter: filter files correctly") {


Note to reviewers: two tests were moved from FileBasedDataSourceSuite as this PR adds the specific suite for filtering option.

maropu

I left minor comments and the other parts look fine. cc: @HyukjinKwon @dongjoon-hyun @viirya

maropu · 2020-11-20T00:56:37Z

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

+    // $example on:load_with_modified_time_filter$
+    val beforeFilterDF = spark.read.format("parquet")
+        // Files modified before 07/01/2020 at 05:30 are allowed
+        .option("modifiedBefore", "2020-07-01T05:30:00")


nit: two indents to follow the other examples.

maropu · 2020-11-20T00:56:46Z

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

+    // +-------------+
+    val afterFilterDF = spark.read.format("parquet")
+         // Files modified after 06/01/2020 at 05:30 are allowed
+        .option("modifiedAfter", "2020-06-01T05:30:00")


maropu · 2020-11-20T00:57:34Z

docs/sql-data-sources-generic-options.md

+`modifiedBefore` and `modifiedAfter` are options that can be 
+applied together or separately in order to achieve greater
+granularity over which files may load during a Spark batch query.
+(Structured Streaming file source doesn't support these options.)


nit: (Structured Streaming file source doesn't support these options.) -> Note that Structured Streaming file sources don't support these options.?

maropu · 2020-11-20T01:02:44Z

python/pyspark/sql/readwriter.py

            properties = dict()
-        jprop = JavaClass("java.util.Properties", self._spark._sc._gateway._gateway_client)()
+        jprop = JavaClass("java.util.Properties",
+                          self._spark._sc._gateway._gateway_client)()


a unnecessary change?

maropu · 2020-11-20T01:02:52Z

python/pyspark/sql/readwriter.py

            gateway = self._spark._sc._gateway
-            jpredicates = utils.toJArray(gateway, gateway.jvm.java.lang.String, predicates)
+            jpredicates = utils.toJArray(
+                gateway, gateway.jvm.java.lang.String, predicates)


ditto (I have the same comments on the changes below, too).

maropu · 2020-11-20T01:04:57Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

  protected def leafDirToChildrenFiles: Map[Path, Array[FileStatus]]

  private val caseInsensitiveMap = CaseInsensitiveMap(parameters)
+  protected val pathFilters = PathFilterFactory.create(caseInsensitiveMap)


protected -> private?

maropu · 2020-11-20T01:08:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala

+  private def checkDisallowedOptions(options: Map[String, String]): Unit = {
+    Seq(ModifiedBeforeFilter.PARAM_NAME, ModifiedAfterFilter.PARAM_NAME).foreach { param =>
+      if (parameters.contains(param)) {
+        throw new IllegalArgumentException(s"option '$param' is not allowed in file stream source")


nit: file stream source -> file stream sources?

SparkQA · 2020-11-20T04:26:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35996/

SparkQA · 2020-11-20T04:53:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35996/

HeartSaVioR · 2020-11-20T05:05:40Z

Thanks for the review, @maropu !

The origin PR has been open for months, and I only refactored a bit & fixed the doc. I'll merge this in early next week if there's no further comment.

maropu · 2020-11-20T05:37:39Z

The origin PR has been open for months, and I only refactored a bit & fixed the doc. I'll merge this in early next week if there's no further comment.

Yea, it looks fine to me. Thanks for the take-over, @HeartSaVioR and thanks a lot for the valuable contribution, @cchighman !

SparkQA · 2020-11-20T06:43:43Z

Test build #131392 has finished for PR 30411 at commit ce00b6d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-20T06:59:47Z

Build failure is not related. I'll let it go and check just before the merge.

HeartSaVioR · 2020-11-22T18:59:53Z

retest this, please

SparkQA · 2020-11-22T19:42:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36110/

SparkQA · 2020-11-22T20:05:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36110/

SparkQA · 2020-11-22T23:26:43Z

Test build #131506 has finished for PR 30411 at commit ce00b6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-22T23:28:11Z

OK. No further comments, and test passed. Merging to master.

HeartSaVioR · 2020-11-22T23:33:33Z

Thanks all for reviewing, and thanks again @cchighman for providing great stuff!

HyukjinKwon · 2020-11-25T03:30:32Z

I am supportive of this change FWIW. I think it's good to have this feature.

amadav · 2021-01-26T23:52:16Z

Does it still need to be merged?

HeartSaVioR · 2021-01-27T00:46:57Z

No this is merged and will be available in new minor version, release 3.1.1.

cchighman and others added 2 commits November 18, 2020 14:30

[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options w…

154e83b

…hen filtering from a batch-based file data source

Address my own review comments

80170b4

github-actions bot added CORE DOCS EXAMPLES PYTHON R SQL labels Nov 18, 2020

Make sure the new options only work with batch query (file batch source)

088b630

HeartSaVioR mentioned this pull request Nov 18, 2020

[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #28841

Closed

github-actions bot added the STRUCTURED STREAMING label Nov 18, 2020

checkstyle fix

71456f2

Simply check the count via providing schema instead of checking excep…

d216fcb

…tion message on schema inference

HeartSaVioR commented Nov 18, 2020

View reviewed changes

maropu approved these changes Nov 20, 2020

View reviewed changes

Address review comments

ce00b6d

HeartSaVioR closed this in d338af3 Nov 22, 2020

HeartSaVioR deleted the SPARK-31962 branch November 22, 2020 23:33

[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #30411

[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #30411

Uh oh!

Conversation

HeartSaVioR commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

HeartSaVioR commented Nov 18, 2020

Uh oh!

HeartSaVioR commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

HeartSaVioR commented Nov 20, 2020

Uh oh!

maropu commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

HeartSaVioR commented Nov 20, 2020

Uh oh!

HeartSaVioR commented Nov 22, 2020

Uh oh!

SparkQA commented Nov 22, 2020

Uh oh!

SparkQA commented Nov 22, 2020

Uh oh!

SparkQA commented Nov 22, 2020

Uh oh!

HeartSaVioR commented Nov 22, 2020

Uh oh!

HeartSaVioR commented Nov 18, 2020 •

edited

Loading

HeartSaVioR commented Nov 22, 2020 •

edited

Loading