Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Dec 2, 2021

What changes were proposed in this pull request?

Currently, Spark supports push down filters, aggregates and limit. All the job is completed by V2ScanRelationPushDown.
But V2ScanRelationPushDown have a lot limit.
Users want apply custom rule for push down after V2ScanRelationPushDown failed.

Why are the changes needed?

Easy for users to apply custom pushdown rules.

Does this PR introduce any user-facing change?

'Yes'.
Users can inject custom early scan pushdown rules.

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Dec 2, 2021
@HyukjinKwon HyukjinKwon changed the title [SPARK-37518] inject a early scan pushdown rule [SPARK-37518][SQL] Inject a early scan pushdown rule Dec 2, 2021
@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50329/

@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50329/

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM(non-binding)

Generally speaking, it is useful. @beliefer Can you give a simple example to show the actual user case?

@SparkQA
Copy link

SparkQA commented Dec 2, 2021

Test build #145854 has finished for PR 34779 at commit a3bc014.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, cc @cloud-fan . Also cc @aokolnychyi @RussellSpitzer I remember this might be useful for Iceberg too?

}
}

test("SPARK-37518: inject a early scan push down rule") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we only require JIRA id for bug fixes and regressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just references

* The injected rules will be executed once after the operator optimization batch and
* after any push down optimization rules.
*/
def injectEarlyScanPushDownRules(builder: RuleBuilder): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: injectEarlyScanPushDownRules -> injectEarlyScanPushDownRule

@beliefer
Copy link
Contributor Author

beliefer commented Dec 3, 2021

LGTM(non-binding)

Generally speaking, it is useful. @beliefer Can you give a simple example to show the actual user case?

Spark SQL supports aggregate pushdown only for standard aggregate function. But some databases have some non-standard aggregate function, This PR open a door for flexible customization.

Some databases if good for the other different pushdown type.

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50345/

@LuciferYang
Copy link
Contributor

Thanks for your explanation @beliefer

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50345/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Test build #145870 has finished for PR 34779 at commit 746278a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

cc @cloud-fan and @maryannxue FYI

* The injected rules will be executed once after the operator optimization batch and
* after any push down optimization rules.
*/
def injectEarlyScanPushDownRule(builder: RuleBuilder): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the classdoc, which only mentions

 * This current provides the following extension points:
 *
 * <ul>
 * <li>Analyzer Rules.</li>
 * <li>Check Analysis Rules.</li>
 * <li>Optimizer Rules.</li>
 * <li>Pre CBO Rules.</li>
 * <li>Planning Strategies.</li>
 * <li>Customized Parser.</li>
 * <li>(External) Catalog listeners.</li>
 * <li>Columnar Rules.</li>
 * <li>Adaptive Query Stage Preparation Rules.</li>
 * </ul>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to clarify how is this different from Pre CBO Rules

@cloud-fan
Copy link
Contributor

Since we already have this extension point inBaseSessionStateBuilder, I'm OK to expose it in the developer API. My only concern is we should document it clearly.

@beliefer
Copy link
Contributor Author

beliefer commented Dec 7, 2021

Since we already have this extension point inBaseSessionStateBuilder, I'm OK to expose it in the developer API. My only concern is we should document it clearly.

OK

@SparkQA
Copy link

SparkQA commented Dec 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50462/

@SparkQA
Copy link

SparkQA commented Dec 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50462/

@SparkQA
Copy link

SparkQA commented Dec 7, 2021

Test build #145986 has finished for PR 34779 at commit 9596820.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* <li>Check Analysis Rules.</li>
* <li>Optimizer Rules.</li>
* <li>Pre CBO Rules.</li>
* <li>Early Scan Push-Down</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After so many discussions in #30808 , I'm really worried about the naming of this new extension point.

In general, this new extension point allows people to inject custom data source operator pushdown rules, which run after the built-in ones. But then the existing Pre CBO rules becomes a confusing name, as the pushdown rules are also pre-CBO.

We may need more time to think about the naming, or think if really need custom data source pushdown rules.

Copy link
Contributor Author

@beliefer beliefer Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that the name of Pre-CBO Rules is too wide and the meaning in the comment is not the same. In my opinion we should not limit the flexibility because of this name.

@beliefer beliefer changed the title [SPARK-37518][SQL] Inject a early scan pushdown rule [SPARK-37518][SQL] Inject an early scan pushdown rule Dec 14, 2021
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 25, 2022
@beliefer
Copy link
Contributor Author

OK. Let me close it.

@beliefer beliefer closed this Mar 25, 2022
@advancedxy
Copy link
Contributor

@beliefer @cloud-fan we found it useful to inject custom early pushdown rules in practice, such as to rewrite some transform expression that's not yet identified by Spark. Do you think it's possible to resume this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants