[SPARK-37518][SQL] Inject an early scan pushdown rule #34779

beliefer · 2021-12-02T11:01:55Z

What changes were proposed in this pull request?

Currently, Spark supports push down filters, aggregates and limit. All the job is completed by V2ScanRelationPushDown.
But V2ScanRelationPushDown have a lot limit.
Users want apply custom rule for push down after V2ScanRelationPushDown failed.

Why are the changes needed?

Easy for users to apply custom pushdown rules.

Does this PR introduce any user-facing change?

'Yes'.
Users can inject custom early scan pushdown rules.

How was this patch tested?

New tests.

SparkQA · 2021-12-02T12:15:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50329/

SparkQA · 2021-12-02T13:11:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50329/

LuciferYang

LGTM(non-binding)

Generally speaking, it is useful. @beliefer Can you give a simple example to show the actual user case?

SparkQA · 2021-12-02T16:13:44Z

Test build #145854 has finished for PR 34779 at commit a3bc014.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao

Looks reasonable, cc @cloud-fan . Also cc @aokolnychyi @RussellSpitzer I remember this might be useful for Iceberg too?

sunchao · 2021-12-02T21:18:49Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

    }
  }

+  test("SPARK-37518: inject a early scan push down rule") {


nit: I think we only require JIRA id for bug fixes and regressions?

Just references

spark/sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

Line 92 in a3bc014

test("SPARK-33621: inject a pre CBO rule") {

sunchao · 2021-12-02T21:19:28Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala

+   * The injected rules will be executed once after the operator optimization batch and
+   * after any push down optimization rules.
+   */
+  def injectEarlyScanPushDownRules(builder: RuleBuilder): Unit = {


nit: injectEarlyScanPushDownRules -> injectEarlyScanPushDownRule

beliefer · 2021-12-03T00:49:19Z

LGTM(non-binding)

Generally speaking, it is useful. @beliefer Can you give a simple example to show the actual user case?

Spark SQL supports aggregate pushdown only for standard aggregate function. But some databases have some non-standard aggregate function, This PR open a door for flexible customization.

Some databases if good for the other different pushdown type.

SparkQA · 2021-12-03T02:23:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50345/

LuciferYang · 2021-12-03T02:43:58Z

Thanks for your explanation @beliefer

SparkQA · 2021-12-03T03:18:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50345/

SparkQA · 2021-12-03T07:06:11Z

Test build #145870 has finished for PR 34779 at commit 746278a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-12-03T10:15:03Z

cc @cloud-fan and @maryannxue FYI

cloud-fan · 2021-12-06T15:53:14Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala

+   * The injected rules will be executed once after the operator optimization batch and
+   * after any push down optimization rules.
+   */
+  def injectEarlyScanPushDownRule(builder: RuleBuilder): Unit = {


Let's update the classdoc, which only mentions

* This current provides the following extension points: * * <ul> * <li>Analyzer Rules.</li> * <li>Check Analysis Rules.</li> * <li>Optimizer Rules.</li> * <li>Pre CBO Rules.</li> * <li>Planning Strategies.</li> * <li>Customized Parser.</li> * <li>(External) Catalog listeners.</li> * <li>Columnar Rules.</li> * <li>Adaptive Query Stage Preparation Rules.</li> * </ul>

We also need to clarify how is this different from Pre CBO Rules

cloud-fan · 2021-12-06T15:55:01Z

Since we already have this extension point inBaseSessionStateBuilder, I'm OK to expose it in the developer API. My only concern is we should document it clearly.

beliefer · 2021-12-07T10:18:09Z

Since we already have this extension point inBaseSessionStateBuilder, I'm OK to expose it in the developer API. My only concern is we should document it clearly.

OK

SparkQA · 2021-12-07T11:46:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50462/

SparkQA · 2021-12-07T12:32:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50462/

SparkQA · 2021-12-07T15:28:39Z

Test build #145986 has finished for PR 34779 at commit 9596820.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-12-07T16:18:27Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala

 * <li>Check Analysis Rules.</li>
 * <li>Optimizer Rules.</li>
 * <li>Pre CBO Rules.</li>
+ * <li>Early Scan Push-Down</li>


After so many discussions in #30808 , I'm really worried about the naming of this new extension point.

In general, this new extension point allows people to inject custom data source operator pushdown rules, which run after the built-in ones. But then the existing Pre CBO rules becomes a confusing name, as the pushdown rules are also pre-CBO.

We may need more time to think about the naming, or think if really need custom data source pushdown rules.

I feel that the name of Pre-CBO Rules is too wide and the meaning in the comment is not the same. In my opinion we should not limit the flexibility because of this name.

github-actions · 2022-03-25T00:16:00Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

beliefer · 2022-03-25T00:26:18Z

OK. Let me close it.

advancedxy · 2023-12-04T06:45:57Z

@beliefer @cloud-fan we found it useful to inject custom early pushdown rules in practice, such as to rewrite some transform expression that's not yet identified by Spark. Do you think it's possible to resume this PR?

[SPARK-37518] inject a early scan push down rule

a3bc014

github-actions bot added the SQL label Dec 2, 2021

HyukjinKwon changed the title ~~[SPARK-37518] inject a early scan pushdown rule~~ [SPARK-37518][SQL] Inject a early scan pushdown rule Dec 2, 2021

LuciferYang approved these changes Dec 2, 2021

View reviewed changes

sunchao reviewed Dec 2, 2021

View reviewed changes

Update code

746278a

sunchao approved these changes Dec 3, 2021

View reviewed changes

cloud-fan reviewed Dec 6, 2021

View reviewed changes

Update code

9596820

cloud-fan reviewed Dec 7, 2021

View reviewed changes

beliefer changed the title ~~[SPARK-37518][SQL] Inject a early scan pushdown rule~~ [SPARK-37518][SQL] Inject an early scan pushdown rule Dec 14, 2021

github-actions bot added the Stale label Mar 25, 2022

beliefer closed this Mar 25, 2022

[SPARK-37518][SQL] Inject an early scan pushdown rule #34779

[SPARK-37518][SQL] Inject an early scan pushdown rule #34779

Uh oh!

Conversation

beliefer commented Dec 2, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao Dec 2, 2021

Choose a reason for hiding this comment

Uh oh!

beliefer Dec 3, 2021

Choose a reason for hiding this comment

Uh oh!

sunchao Dec 2, 2021

Choose a reason for hiding this comment

Uh oh!

beliefer commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 3, 2021

Uh oh!

LuciferYang commented Dec 3, 2021

Uh oh!

SparkQA commented Dec 3, 2021

Uh oh!

SparkQA commented Dec 3, 2021

Uh oh!

HyukjinKwon commented Dec 3, 2021

Uh oh!

cloud-fan Dec 6, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 6, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 6, 2021

Uh oh!

beliefer commented Dec 7, 2021

Uh oh!

SparkQA commented Dec 7, 2021

Uh oh!

SparkQA commented Dec 7, 2021

Uh oh!

SparkQA commented Dec 7, 2021

Uh oh!

cloud-fan Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

beliefer Dec 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 25, 2022

Uh oh!

beliefer commented Mar 25, 2022

Uh oh!

advancedxy commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

beliefer commented Dec 3, 2021 •

edited

Loading

beliefer Dec 8, 2021 •

edited

Loading