-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17091] Add rule to convert IN predicate to equivalent Parquet filter. #18424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e98fbd8 to
8c443a7
Compare
|
Indeed it appears to be. The resolution from my previous PR was that per @HyukjinKwon's benchmarks, performing the disjunction in Spark was slightly more performant than pushing it down to Parquet. I haven't been following Spark closely these past 12 months so things may have changed. @ptkool did you do any profiling that would lead you to believe pushing the filter to Parquet leads to perf improvements? |
|
@a10y Yes. Please have a look at my comments in https://issues.apache.org/jira/browse/SPARK-21218. |
|
(I would like to suggest fix the JIRA in the PR title to point out SPARK-17091) |
|
Have you done actual benchmarks to validate that this is a perf improvement? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always push-down? Should we also consider the number of elements in values? What is the performance impact when the number of values is around 10 or more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can eliminate the var by using reduceLeft
|
ok to test |
|
Now we will have this for row group filtering in most cases after #15049. I believe it makes sense in this case. |
|
Test build #83844 has finished for PR 18424 at commit
|
|
I believe I need to cc @jiangxb1987 and @viirya too who activiely reviewed my PR. |
|
cc @liancheng too who I know is insightful in this. |
|
Please rebase this PR to the latest master. Thanks! |
|
I guess this is inactive now. |
|
@ptkool are you still tracking this at all? |
|
@a10y Yes, I'm still tracking this. |
8c443a7 to
62f273b
Compare
|
ok to test |
|
Test build #85573 has finished for PR 18424 at commit
|
|
ok to test |
|
Test build #91614 has finished for PR 18424 at commit
|
|
@ptkool Are you still working on? |
|
@wangyum, can you take over this? Seems it's been inactive long time. |
…quet filter ## What changes were proposed in this pull request? The original pr is: apache#18424 Add a new optimizer rule to convert an IN predicate to an equivalent Parquet filter and add `spark.sql.parquet.pushdown.inFilterThreshold` to control limit thresholds. Different data types have different limit thresholds, this is a copy of data for reference: Type | limit threshold -- | -- string | 370 int | 210 long | 285 double | 270 float | 220 decimal | Won't provide better performance before [SPARK-24549](https://issues.apache.org/jira/browse/SPARK-24549) ## How was this patch tested? unit tests and manual tests Author: Yuming Wang <[email protected]> Closes apache#21603 from wangyum/SPARK-17091.
What changes were proposed in this pull request?
Add a new optimizer rule to convert an IN predicate to an equivalent Parquet filter.
How was this patch tested?
Tested using unit tests, integration tests, and manual tests.