-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-37518][SQL] Inject an early scan pushdown rule #34779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM(non-binding)
Generally speaking, it is useful. @beliefer Can you give a simple example to show the actual user case?
|
Test build #145854 has finished for PR 34779 at commit
|
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, cc @cloud-fan . Also cc @aokolnychyi @RussellSpitzer I remember this might be useful for Iceberg too?
| } | ||
| } | ||
|
|
||
| test("SPARK-37518: inject a early scan push down rule") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we only require JIRA id for bug fixes and regressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just references
spark/sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala
Line 92 in a3bc014
| test("SPARK-33621: inject a pre CBO rule") { |
| * The injected rules will be executed once after the operator optimization batch and | ||
| * after any push down optimization rules. | ||
| */ | ||
| def injectEarlyScanPushDownRules(builder: RuleBuilder): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: injectEarlyScanPushDownRules -> injectEarlyScanPushDownRule
Spark SQL supports aggregate pushdown only for standard aggregate function. But some databases have some non-standard aggregate function, This PR open a door for flexible customization. Some databases if good for the other different pushdown type. |
|
Kubernetes integration test starting |
|
Thanks for your explanation @beliefer |
|
Kubernetes integration test status failure |
|
Test build #145870 has finished for PR 34779 at commit
|
|
cc @cloud-fan and @maryannxue FYI |
| * The injected rules will be executed once after the operator optimization batch and | ||
| * after any push down optimization rules. | ||
| */ | ||
| def injectEarlyScanPushDownRule(builder: RuleBuilder): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the classdoc, which only mentions
* This current provides the following extension points:
*
* <ul>
* <li>Analyzer Rules.</li>
* <li>Check Analysis Rules.</li>
* <li>Optimizer Rules.</li>
* <li>Pre CBO Rules.</li>
* <li>Planning Strategies.</li>
* <li>Customized Parser.</li>
* <li>(External) Catalog listeners.</li>
* <li>Columnar Rules.</li>
* <li>Adaptive Query Stage Preparation Rules.</li>
* </ul>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to clarify how is this different from Pre CBO Rules
|
Since we already have this extension point in |
OK |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #145986 has finished for PR 34779 at commit
|
| * <li>Check Analysis Rules.</li> | ||
| * <li>Optimizer Rules.</li> | ||
| * <li>Pre CBO Rules.</li> | ||
| * <li>Early Scan Push-Down</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After so many discussions in #30808 , I'm really worried about the naming of this new extension point.
In general, this new extension point allows people to inject custom data source operator pushdown rules, which run after the built-in ones. But then the existing Pre CBO rules becomes a confusing name, as the pushdown rules are also pre-CBO.
We may need more time to think about the naming, or think if really need custom data source pushdown rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that the name of Pre-CBO Rules is too wide and the meaning in the comment is not the same. In my opinion we should not limit the flexibility because of this name.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
OK. Let me close it. |
|
@beliefer @cloud-fan we found it useful to inject custom early pushdown rules in practice, such as to rewrite some transform expression that's not yet identified by Spark. Do you think it's possible to resume this PR? |
What changes were proposed in this pull request?
Currently, Spark supports push down filters, aggregates and limit. All the job is completed by
V2ScanRelationPushDown.But
V2ScanRelationPushDownhave a lot limit.Users want apply custom rule for push down after
V2ScanRelationPushDownfailed.Why are the changes needed?
Easy for users to apply custom pushdown rules.
Does this PR introduce any user-facing change?
'Yes'.
Users can inject custom early scan pushdown rules.
How was this patch tested?
New tests.