[SPARK-17091] Add rule to convert IN predicate to equivalent Parquet filter. #18424

ptkool · 2017-06-26T19:05:01Z

What changes were proposed in this pull request?

Add a new optimizer rule to convert an IN predicate to an equivalent Parquet filter.

How was this patch tested?

Tested using unit tests, integration tests, and manual tests.

HyukjinKwon · 2017-06-26T22:46:59Z

Is it a duplicate of #14671? cc @a10y.

a10y · 2017-06-26T23:20:59Z

Indeed it appears to be. The resolution from my previous PR was that per @HyukjinKwon's benchmarks, performing the disjunction in Spark was slightly more performant than pushing it down to Parquet.

I haven't been following Spark closely these past 12 months so things may have changed.

@ptkool did you do any profiling that would lead you to believe pushing the filter to Parquet leads to perf improvements?

ptkool · 2017-06-27T16:58:05Z

@a10y Yes. Please have a look at my comments in https://issues.apache.org/jira/browse/SPARK-21218.

HyukjinKwon · 2017-06-28T00:19:21Z

(I would like to suggest fix the JIRA in the PR title to point out SPARK-17091)

rxin · 2017-06-30T06:55:51Z

Have you done actual benchmarks to validate that this is a perf improvement?

ptkool · 2017-07-05T18:45:45Z

@rxin Yes, I have.

https://issues.apache.org/jira/browse/SPARK-21218?focusedCommentId=16064608&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16064608

gatorsmile · 2017-07-08T00:22:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Always push-down? Should we also consider the number of elements in values? What is the performance impact when the number of values is around 10 or more?

a10y · 2017-07-16T17:18:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

You can eliminate the var by using reduceLeft

HyukjinKwon · 2017-11-14T11:39:21Z

ok to test

HyukjinKwon · 2017-11-14T11:41:11Z

Now we will have this for row group filtering in most cases after #15049. I believe it makes sense in this case.

SparkQA · 2017-11-14T11:43:56Z

Test build #83844 has finished for PR 18424 at commit 8c443a7.

This patch fails some tests.
This patch does not merge cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-14T11:45:22Z

I believe I need to cc @jiangxb1987 and @viirya too who activiely reviewed my PR.

HyukjinKwon · 2017-11-14T11:46:30Z

cc @liancheng too who I know is insightful in this.

jiangxb1987 · 2017-11-14T14:09:21Z

Please rebase this PR to the latest master. Thanks!

viirya · 2017-11-16T06:49:06Z

I guess this is inactive now.

a10y · 2017-12-01T23:05:15Z

@ptkool are you still tracking this at all?

ptkool · 2017-12-04T13:22:33Z

@a10y Yes, I'm still tracking this.

HyukjinKwon · 2018-01-01T04:09:51Z

ok to test

SparkQA · 2018-01-01T04:14:31Z

Test build #85573 has finished for PR 18424 at commit 62f273b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-09T09:06:45Z

ok to test

SparkQA · 2018-06-09T09:10:17Z

Test build #91614 has finished for PR 18424 at commit 62f273b.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

wangyum · 2018-06-19T14:56:48Z

@ptkool Are you still working on?

HyukjinKwon · 2018-06-20T10:47:42Z

@wangyum, can you take over this? Seems it's been inactive long time.

…quet filter ## What changes were proposed in this pull request? The original pr is: apache#18424 Add a new optimizer rule to convert an IN predicate to an equivalent Parquet filter and add `spark.sql.parquet.pushdown.inFilterThreshold` to control limit thresholds. Different data types have different limit thresholds, this is a copy of data for reference: Type | limit threshold -- | -- string | 370 int | 210 long | 285 double | 270 float | 220 decimal | Won't provide better performance before [SPARK-24549](https://issues.apache.org/jira/browse/SPARK-24549) ## How was this patch tested? unit tests and manual tests Author: Yuming Wang <[email protected]> Closes apache#21603 from wangyum/SPARK-17091.

ptkool force-pushed the convert_in_predicate_for_parquet branch from e98fbd8 to 8c443a7 Compare June 26, 2017 19:06

ptkool changed the title ~~[SPARK-21218] Add rule to convert IN predicate to equivalent Parquet filter.~~ [SPARK-17091] Add rule to convert IN predicate to equivalent Parquet filter. Jun 28, 2017

gatorsmile reviewed Jul 8, 2017

View reviewed changes

a10y reviewed Jul 16, 2017

View reviewed changes

Add rule to convert IN predicate to equivalent Parquet filter.

62f273b

ptkool force-pushed the convert_in_predicate_for_parquet branch from 8c443a7 to 62f273b Compare December 4, 2017 13:25

wangyum mentioned this pull request Jun 21, 2018

[SPARK-17091][SQL] Add rule to convert IN predicate to equivalent Parquet filter #21603

Closed

wangyum mentioned this pull request Aug 21, 2018

[BUILD] Close stale PRs #22159

Closed

asfgit closed this in b8788b3 Aug 21, 2018

ptkool deleted the convert_in_predicate_for_parquet branch January 18, 2020 12:14

[SPARK-17091] Add rule to convert IN predicate to equivalent Parquet filter. #18424

[SPARK-17091] Add rule to convert IN predicate to equivalent Parquet filter. #18424

Uh oh!

Conversation

ptkool commented Jun 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jun 26, 2017

Uh oh!

a10y commented Jun 26, 2017

Uh oh!

ptkool commented Jun 27, 2017

Uh oh!

HyukjinKwon commented Jun 28, 2017

Uh oh!

rxin commented Jun 30, 2017

Uh oh!

ptkool commented Jul 5, 2017

Uh oh!

gatorsmile Jul 8, 2017

Choose a reason for hiding this comment

Uh oh!

a10y Jul 16, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 14, 2017

Uh oh!

HyukjinKwon commented Nov 14, 2017

Uh oh!

SparkQA commented Nov 14, 2017

Uh oh!

HyukjinKwon commented Nov 14, 2017

Uh oh!

HyukjinKwon commented Nov 14, 2017

Uh oh!

jiangxb1987 commented Nov 14, 2017

Uh oh!

viirya commented Nov 16, 2017

Uh oh!

a10y commented Dec 1, 2017

Uh oh!

ptkool commented Dec 4, 2017

Uh oh!

HyukjinKwon commented Jan 1, 2018

Uh oh!

SparkQA commented Jan 1, 2018

Uh oh!

HyukjinKwon commented Jun 9, 2018

Uh oh!

SparkQA commented Jun 9, 2018

Uh oh!

wangyum commented Jun 19, 2018

Uh oh!

HyukjinKwon commented Jun 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants