Skip to content

Conversation

@ptkool
Copy link
Contributor

@ptkool ptkool commented Jun 26, 2017

What changes were proposed in this pull request?

Add a new optimizer rule to convert an IN predicate to an equivalent Parquet filter.

How was this patch tested?

Tested using unit tests, integration tests, and manual tests.

@ptkool ptkool force-pushed the convert_in_predicate_for_parquet branch from e98fbd8 to 8c443a7 Compare June 26, 2017 19:06
@HyukjinKwon
Copy link
Member

Is it a duplicate of #14671? cc @a10y.

@a10y
Copy link

a10y commented Jun 26, 2017

Indeed it appears to be. The resolution from my previous PR was that per @HyukjinKwon's benchmarks, performing the disjunction in Spark was slightly more performant than pushing it down to Parquet.

I haven't been following Spark closely these past 12 months so things may have changed.

@ptkool did you do any profiling that would lead you to believe pushing the filter to Parquet leads to perf improvements?

@ptkool
Copy link
Contributor Author

ptkool commented Jun 27, 2017

@a10y Yes. Please have a look at my comments in https://issues.apache.org/jira/browse/SPARK-21218.

@HyukjinKwon
Copy link
Member

(I would like to suggest fix the JIRA in the PR title to point out SPARK-17091)

@ptkool ptkool changed the title [SPARK-21218] Add rule to convert IN predicate to equivalent Parquet filter. [SPARK-17091] Add rule to convert IN predicate to equivalent Parquet filter. Jun 28, 2017
@rxin
Copy link
Contributor

rxin commented Jun 30, 2017

Have you done actual benchmarks to validate that this is a perf improvement?

@ptkool
Copy link
Contributor Author

ptkool commented Jul 5, 2017

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always push-down? Should we also consider the number of elements in values? What is the performance impact when the number of values is around 10 or more?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can eliminate the var by using reduceLeft

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

Now we will have this for row group filtering in most cases after #15049. I believe it makes sense in this case.

@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83844 has finished for PR 18424 at commit 8c443a7.

  • This patch fails some tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I believe I need to cc @jiangxb1987 and @viirya too who activiely reviewed my PR.

@HyukjinKwon
Copy link
Member

cc @liancheng too who I know is insightful in this.

@jiangxb1987
Copy link
Contributor

Please rebase this PR to the latest master. Thanks!

@viirya
Copy link
Member

viirya commented Nov 16, 2017

I guess this is inactive now.

@a10y
Copy link

a10y commented Dec 1, 2017

@ptkool are you still tracking this at all?

@ptkool
Copy link
Contributor Author

ptkool commented Dec 4, 2017

@a10y Yes, I'm still tracking this.

@ptkool ptkool force-pushed the convert_in_predicate_for_parquet branch from 8c443a7 to 62f273b Compare December 4, 2017 13:25
@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jan 1, 2018

Test build #85573 has finished for PR 18424 at commit 62f273b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91614 has finished for PR 18424 at commit 62f273b.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Jun 19, 2018

@ptkool Are you still working on?

@HyukjinKwon
Copy link
Member

@wangyum, can you take over this? Seems it's been inactive long time.

HyukjinKwon pushed a commit to HyukjinKwon/spark that referenced this pull request Jul 14, 2018
…quet filter

## What changes were proposed in this pull request?

The original pr is: apache#18424

Add a new optimizer rule to convert an IN predicate to an equivalent Parquet filter and add `spark.sql.parquet.pushdown.inFilterThreshold` to control limit thresholds. Different data types have different limit thresholds, this is a copy of data for reference:

Type | limit threshold
-- | --
string | 370
int | 210
long | 285
double | 270
float | 220
decimal | Won't provide better performance before [SPARK-24549](https://issues.apache.org/jira/browse/SPARK-24549)

## How was this patch tested?
unit tests and manual tests

Author: Yuming Wang <[email protected]>

Closes apache#21603 from wangyum/SPARK-17091.
@asfgit asfgit closed this in b8788b3 Aug 21, 2018
@ptkool ptkool deleted the convert_in_predicate_for_parquet branch January 18, 2020 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants