[SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL][test-hive1.2] Nested Column Predicate Pushdown for Parquet #27728

dbtsai · 2020-02-28T00:01:59Z

What changes were proposed in this pull request?

DataSourceStrategy.scala is extended to create org.apache.spark.sql.sources.Filter from nested expressions.
Translation from nested org.apache.spark.sql.sources.Filter to org.apache.parquet.filter2.predicate.FilterPredicate is implemented to support nested predicate pushdown for Parquet.

Why are the changes needed?

Better performance for handling nested predicate pushdown.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests are added.

SparkQA · 2020-02-28T00:06:53Z

Test build #119057 has finished for PR 27728 at commit 79c2b29.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class NestedFilterApi

dbtsai · 2020-02-28T00:08:25Z

cc @dongjoon-hyun @holdenk @rdblue @gatorsmile @cloud-fan @hvanhovell @viirya @MaxGekk @HyukjinKwon @aokolnychyi @JoshRosen @gengliangwang

sql/core/src/main/java/org/apache/parquet/filter2/predicate/NestedFilterApi.java

dbtsai · 2020-02-28T00:10:40Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

@HyukjinKwon this PR also fixes the old problem that we don't allow to pushdown a column name containing "dot".

dongjoon-hyun · 2020-02-28T01:09:43Z

Thank you, @dbtsai !

SparkQA · 2020-02-28T03:23:17Z

Test build #119061 has finished for PR 27728 at commit 4b27c82.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-28T03:39:12Z

Test build #119059 has finished for PR 27728 at commit 400aebe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-28T03:59:58Z

Test build #119060 has finished for PR 27728 at commit 3a2fb39.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Two minor comments from a quick skim, thanks for working on this :)

holdenk · 2020-02-28T20:29:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala

Minor: would this not be 3.1.0?

+1, seems to be 3.1.0

Thanks. Going to fix it.

holdenk · 2020-02-28T20:33:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

I like this simplification 👍

SparkQA · 2020-02-28T21:04:58Z

Test build #119102 has finished for PR 27728 at commit 0ae261d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-02-28T21:20:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala

What if there is a column name containing .

How about pass the schema or top-level output attributes as a parameter, and judge if the filter is of a nested column.

This is a quick implementation. I plan to write a parser to parse nested column name.

For example, a.b is a nested field while

`a.b`

is a field name containing dot. WDYT?

gengliangwang · 2020-02-28T21:21:55Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

Nit: I think we can add a test case with more than 2 level nested columns here.

I added a test case for three level nested field.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

viirya · 2020-02-29T00:10:22Z

sql/core/src/main/java/org/apache/parquet/filter2/predicate/NestedFilterApi.java

So at each of segment of the path, dot is allowed?

Yes. But this is a private API in Parquet.

viirya · 2020-02-29T00:13:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Based on the previous comment, we still need to prevent `a.b`.c.d?

No. we don't have to. Should work.

SparkQA · 2020-02-29T01:40:27Z

Test build #119107 has finished for PR 27728 at commit 2da3737.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-29T01:55:46Z

Test build #119104 has finished for PR 27728 at commit 6972b54.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-29T02:50:26Z

Test build #119106 has finished for PR 27728 at commit a1b0c32.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class NestedFilterApi

SparkQA · 2020-02-29T04:23:18Z

Test build #119109 has finished for PR 27728 at commit c706fe5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-29T04:31:28Z

Test build #119108 has finished for PR 27728 at commit 784d372.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class NestedFilterApi

rdblue · 2020-03-02T19:11:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala

Typo: should be unquote

Fixed. Thanks.

rdblue · 2020-03-02T19:15:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala

You can copy how this is done in LogicalExpressions.

val parser = new CatalystSqlParser(SQLConf.get) val nameParts = parser.parseMultipartIdentifier(name)

You can also use the parseReference method if you want to get a FieldReference.

Thanks for the tip. I was planing to implement this myself.

rdblue · 2020-03-02T21:42:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala

This shouldn't create a new parser each time it is used. Can you create a private field for the parser?

Also, is there a better place for this? If there is a common need for a parser for names, then maybe the SparkSqlParser object should expose methods to parse names and have an internal instance?

How about

private lazy val catalystSqlParser = new CatalystSqlParser(SQLConf.get) def unquote(name: String): Seq[String] = { catalystSqlParser.parseMultipartIdentifier(name) }

for now until we find a good place to have an internal instance of catalystSqlParser?

The concern is that if we have it in the singleton, what if the SQLConf is changed?

rdblue · 2020-03-02T21:50:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala

Minor: other places (i.e. Collection#contains) normally use contains instead of contain.

rdblue · 2020-03-02T21:50:48Z

Overall, this looks good to me. Only a couple minor points.

SparkQA · 2020-03-26T17:46:22Z

Test build #120422 has finished for PR 27728 at commit 5fd97c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-27T06:29:12Z

thanks, merging to master/3.0!

…2] Nested Column Predicate Pushdown for Parquet ### What changes were proposed in this pull request? 1. `DataSourceStrategy.scala` is extended to create `org.apache.spark.sql.sources.Filter` from nested expressions. 2. Translation from nested `org.apache.spark.sql.sources.Filter` to `org.apache.parquet.filter2.predicate.FilterPredicate` is implemented to support nested predicate pushdown for Parquet. ### Why are the changes needed? Better performance for handling nested predicate pushdown. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New tests are added. Closes #27728 from dbtsai/SPARK-17636. Authored-by: DB Tsai <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit cb0db21) Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2020-03-27T07:14:10Z

Hi, @cloud-fan .
Could you fix the build failure in branch-3.0? It's a conf version issue which occurred multiple times before.

https://github.com/apache/spark/commits/branch-3.0

[ERROR] [Error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1867:
value version is not a member of org.apache.spark.internal.config.ConfigBuilder

dongjoon-hyun · 2020-03-27T07:21:47Z

I made a hotfix PR.

[SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][FOLLOWUP][3.0] Fix build error due to conf version #28046

HyukjinKwon · 2020-03-27T09:15:43Z

Thanks @dongjoon-hyun.

dbtsai · 2020-03-28T03:49:05Z

Thank you, @cloud-fan @HyukjinKwon @rdblue @viirya @dongjoon-hyun and @holdenk, etc for reviewing!!!

dbtsai · 2020-03-28T03:58:36Z

Ping @MaxGekk , are you interested in working on ORC version of it as you are recently working on this area a bit?

gatorsmile · 2020-04-01T02:15:34Z

@dbtsai Great work! Could you show some perf number ?

dbtsai · 2020-04-01T03:46:42Z

It really depends on the data and if we can skip most of the row group by having a predicate in nested field. For our production jobs, we are able to get 20x speed-up in clock time, and 8x less data being read. I will create a synthetic benchmark for this feature. Thanks,

gatorsmile · 2020-04-01T04:08:48Z

Do you know the perf number for the worst case (e.g., no row can be filtered out)?

HyukjinKwon · 2020-04-01T04:15:29Z

It's true that it can depends on the nature of data or workloads of course but I think it makes sense to at least have a benchmark though, including the (roughly) best and worst cases.

dbtsai · 2020-04-01T04:21:14Z

For the worst case, we don't see extra overhead from parquet side. I agree that we should have benchmark suite for best (filtering out everything in a row group) and the worst case that nothing is filtered out in a row group.

…2] Nested Column Predicate Pushdown for Parquet ### What changes were proposed in this pull request? 1. `DataSourceStrategy.scala` is extended to create `org.apache.spark.sql.sources.Filter` from nested expressions. 2. Translation from nested `org.apache.spark.sql.sources.Filter` to `org.apache.parquet.filter2.predicate.FilterPredicate` is implemented to support nested predicate pushdown for Parquet. ### Why are the changes needed? Better performance for handling nested predicate pushdown. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New tests are added. Closes apache#27728 from dbtsai/SPARK-17636. Authored-by: DB Tsai <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile · 2020-04-19T06:24:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

      case _ => None
    }
-    helper(e)
+    helper(e).map(_.quoted)


Sorry, for my late review. This sounds an API breaking change.

Can we limit this change to some specific data sources? For example, parquet only?

I am afraid it might break the released external third-party connectors that might not be able to handle the quoted column names.

Yes, this was pointed out and discussed at #27728 (comment) and #27728 (comment).
@HeartSaVioR is working on it at this JIRA SPARK-31365. I turned the JIRA to be a blocker for Spark 3.0.

I'm sorry I've been working on my own task (prioritized) and am afraid I can't pick up this soon. Please take this over if anyone has the idea of implementing this.

@viirya Are you interested in this follow up?

I can try looking at this this week. If anyone picks it up before me, I'm also ok.

Thank you for fixing this 3.0 blocker

gatorsmile · 2020-04-19T06:37:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+    import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
+    def helper(e: Expression): Option[Seq[String]] = e match {
+      case a: Attribute =>
+        if (nestedPredicatePushdownEnabled || !a.name.contains(".")) {


Could you explain what this condition means?

### What changes were proposed in this pull request? This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown. ### Why are the changes needed? We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources. In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments #27728 (comment). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added/Modified unit tests. Closes #28366 from viirya/SPARK-31365. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown. ### Why are the changes needed? We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources. In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments #27728 (comment). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added/Modified unit tests. Closes #28366 from viirya/SPARK-31365. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4952f1a) Signed-off-by: Wenchen Fan <[email protected]>

…potential conflicts in dev ### What changes were proposed in this pull request? This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours. Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`. Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment) ### Why are the changes needed? For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #28955 from HyukjinKwon/SPARK-25556-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…potential conflicts in dev ### What changes were proposed in this pull request? This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours. Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`. Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment) ### Why are the changes needed? For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #28955 from HyukjinKwon/SPARK-25556-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 8194d9e) Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? We added nested column predicate pushdown for Parquet in #27728. This patch extends the feature support to ORC. ### Why are the changes needed? Extending the feature to ORC for feature parity. Better performance for handling nested predicate pushdown. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #28761 from viirya/SPARK-25557. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… `Filter` ### What changes were proposed in this pull request? This pr aims remove `private[sql] `function `containsNestedColumn` from `org.apache.spark.sql.sources.Filter`. This function was introduced by #27728 to avoid nested predicate pushdown for Orc. After #28761, Orc also support nested column predicate pushdown, so this function become unused. ### Why are the changes needed? Remove unused `private[sql] ` function `containsNestedColumn`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #42239 from LuciferYang/SPARK-44607. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>

dbtsai mentioned this pull request Feb 28, 2020

[SPARK-17636][SQL][WIP] Parquet predicate pushdown in nested fields #22535

Closed

dbtsai commented Feb 28, 2020

View reviewed changes

sql/core/src/main/java/org/apache/parquet/filter2/predicate/NestedFilterApi.java Outdated Show resolved Hide resolved

dbtsai commented Feb 28, 2020

View reviewed changes

dbtsai mentioned this pull request Feb 28, 2020

[SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields #27155

Closed

dongjoon-hyun added SPARK CORE SQL labels Feb 28, 2020

holdenk reviewed Feb 28, 2020

View reviewed changes

gengliangwang reviewed Feb 28, 2020

View reviewed changes

dbtsai changed the title ~~[SPARK-17636][SQL] Nested Field Predicate Pushdown for Parquet~~ [SPARK-17636][SQL] Nested Column Predicate Pushdown for Parquet Feb 28, 2020

dbtsai force-pushed the SPARK-17636 branch from 6972b54 to a1b0c32 Compare February 28, 2020 23:39

viirya reviewed Feb 29, 2020

View reviewed changes

dbtsai force-pushed the SPARK-17636 branch from 2da3737 to 784d372 Compare February 29, 2020 00:32

rdblue reviewed Mar 2, 2020

View reviewed changes

HyukjinKwon approved these changes Mar 27, 2020

View reviewed changes

viirya approved these changes Mar 27, 2020

View reviewed changes

cloud-fan closed this in cb0db21 Mar 27, 2020

gatorsmile reviewed Apr 19, 2020

View reviewed changes

viirya mentioned this pull request Apr 27, 2020

[SPARK-31365][SQL] Enable nested predicate pushdown per data sources #28366

Closed

viirya mentioned this pull request Jun 9, 2020

[SPARK-25557][SQL] Nested column predicate pushdown for ORC #28761

Closed

HyukjinKwon mentioned this pull request Jun 30, 2020

[SPARK-32142][SQL][TESTS] Keep the original tests and codes to avoid potential conflicts in dev #28955

Closed

LuciferYang mentioned this pull request Jul 31, 2023

[SPARK-44607][SQL] Remove unused function containsNestedColumn from Filter #42239

Closed

[SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL][test-hive1.2] Nested Column Predicate Pushdown for Parquet #27728

[SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL][test-hive1.2] Nested Column Predicate Pushdown for Parquet #27728

Uh oh!

Conversation

dbtsai commented Feb 28, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Feb 28, 2020

Uh oh!

dbtsai commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 28, 2020

Uh oh!

SparkQA commented Feb 28, 2020

Uh oh!

SparkQA commented Feb 28, 2020

Uh oh!

SparkQA commented Feb 28, 2020

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 28, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 29, 2020

Uh oh!

SparkQA commented Feb 29, 2020

Uh oh!

SparkQA commented Feb 29, 2020

Uh oh!

SparkQA commented Feb 29, 2020

Uh oh!

SparkQA commented Feb 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dbtsai commented Feb 28, 2020 •

edited

Loading

rdblue Mar 2, 2020 •

edited

Loading

dongjoon-hyun commented Mar 27, 2020 •

edited

Loading

dbtsai commented Mar 28, 2020 •

edited

Loading

dbtsai commented Apr 1, 2020 •

edited

Loading

HyukjinKwon commented Apr 1, 2020 •

edited

Loading