[SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL #20303

carsonwang · 2018-01-18T01:24:07Z

What changes were proposed in this pull request?

This is the co-work with @yucai , @gczsjdy , @chenghao-intel , @xuanyuanking

We'd like to introduce a new approach to do adaptive execution in Spark SQL. The idea is described at https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing

How was this patch tested?

Updated ExchangeCoordinatorSuite.
We also tested this with all queries in TPC-DS.

carsonwang · 2018-01-18T01:27:38Z

cc @cloud-fan , @gatorsmile , @yhuai

SparkQA · 2018-01-18T05:20:14Z

Test build #86305 has finished for PR 20303 at commit e0b98fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-01-26T09:54:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStage.scala

+    }
+
+    // 3. Codegen and update the UI
+    child = CollapseCodegenStages(sqlContext.conf).apply(child)


Change this line to:

child = child match { case s: WholeStageCodegenExec => s case other => CollapseCodegenStages(sqlContext.conf).apply(other) }

?

child seems won't be WholeStageCodegenExec

yes, @gczsjdy is correct.
In adaptive execution, there is no the whole stage codegen in QueryExecution.adaptivePreparations, so child could not be WholeStageCodegenExec.

SparkQA · 2018-01-26T12:34:38Z

Test build #86697 has finished for PR 20303 at commit 9a1301f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

carsonwang · 2018-01-29T08:04:42Z

Jenkins, retest this please.

SparkQA · 2018-01-29T11:23:51Z

Test build #86757 has finished for PR 20303 at commit 9a1301f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-09T05:04:23Z

Test build #87236 has finished for PR 20303 at commit 603c6d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aaron-aa · 2018-10-04T02:10:55Z

@carsonwang what's your plan to merge this fix into master or some release? thanks a lot!

carsonwang · 2018-10-11T07:41:47Z

@aaron-aa , the committers agreed to start reviewing the code after 2.4 release.

aaron-aa · 2018-10-15T04:25:06Z

@carsonwang Thanks

carsonwang · 2018-11-16T01:29:14Z

@cloud-fan @gatorsmile , are you ready to start reviewing this? I can bring this update to date.

cloud-fan

looks pretty good. One thing I'm unclear is how whole stage codegen is applied to query stages recursively. Can you explain it a little more?

cloud-fan · 2019-01-10T07:12:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanQueryStage.scala

add a comment that this rule must be run after EnsureRequirements.

Sure. Will add it and rebase the code.

Also that this should be applied at last as it actually divide the tree into multiple sub-trees?

Yes, this is commented in QueryExecution when using this rule. Let me also add it here.

cloud-fan · 2019-01-10T07:18:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStage.scala

when mapOutputStatistics can be null?

If the childStage's RDD has 0 partition, we will not submit that stage. See ShuffleExchangeExec.eagerExecute. In that case, mapOutputStatistics will be null so we filter it.

cloud-fan · 2019-01-10T07:21:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStage.scala

why it's a var?

This is a var so that we can update the plan at run time by directly assigning a new child to ShuffleQueryStage. This won't affect other query stages.

carsonwang · 2019-01-10T09:52:56Z

@cloud-fan , we don't apply whole stage codegen to query stages recursively. In QueryStage.prepareExecuteStage, we first execute child stages and wait for their completions. Based on the child stage statistics, we can potentially update the plan and reducer number in current query stage. After that, we do a whole stage codegen only to the plan in the current query stage. Note, the QueryStageInput is a leaf node so the whole stage codegen won't apply to child stages.

…es when call executeCollect, executeToIterator and executeTake action multi-times (apache#70) * Avoid the prepareExecuteStage#QueryStage method is executed multi-times when call executeCollect, executeToIterator and executeTake action multi-times * only add the check in prepareExecuteStage method to avoid duplicate check in other methods * small fix

SparkQA · 2019-01-15T19:01:28Z

Test build #101268 has finished for PR 20303 at commit 5819826.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
.doc(\"Comma separated list of filter class names to apply to the Spark Web UI.\")

SparkQA · 2019-01-15T20:13:05Z

Test build #101267 has finished for PR 20303 at commit 2c55985.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

* do not re-implement exchange reuse * simplify QueryStage * add comments * new idea * polish * address comments * improve QueryStageTrigger

SparkQA · 2019-01-22T11:00:42Z

Test build #101520 has finished for PR 20303 at commit ea93dbf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-22T12:30:09Z

retest this please

SparkQA · 2019-02-28T07:34:34Z

Test build #102848 has finished for PR 20303 at commit bef8ab8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-28T09:28:37Z

Test build #102855 has finished for PR 20303 at commit 2d6f110.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

fangshil · 2019-03-04T06:39:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

-
-    withCoordinator
-  }
+  private def defaultNumPreShufflePartitions: Int =


@carsonwang With AE being the new mode, spark.sql.adaptive.maxNumPostShufflePartitions will replace spark.sql.shuffle.partitions to determine initial shuffle parallelism, this seems to be a significant user-facing change especially if we want to enable AE as cluster-default in the future. E.g., if a user has set spark.sql.shuffle.partitions to 10K for a large join, with AE enabled he has to set maxNumPostShufflePartitions to 10K otherwise he will only get 500. I had to make a small patch when testing AE in production jobs, by setting maxNumPostShufflePartitions = spark.sql.shuffle.partitions, if spark.sql.shuffle.partitions != default/200. What do you think？

If the user is familiar with the AE configuration, can he just set the maxNumPostShufflePartitions for the large join when he enables AE? If he set spark.sql.shuffle.partitions to 10K in non-AE mode. I expect he will set a higher value for maxNumPostShufflePartitions like 20K in AE mode. So AE can find a good reducer number between 1 and 20K and usually a better value than 10K.

@carsonwang, I am afraid we are making a risky assumption that user needs to be familiar with AE config. it is not a concern for me when AE is an on-demand feature. however, I find the current version of this PR is setting spark.sql.adaptive.enabled = true, which means we plan to enable AE mode by default, then when we roll out next version of Spark in our cluster we can break a lot of prod jobs with custom spark.sql.shuffle.partitions. I proposed a change to adjust maxNumPostShufflePartitions based on spark.sql.shuffle.partitions which I think is safer, if the upstream plan is also to set AE as cluster default mode in the future

I proposed a change to adjust maxNumPostShufflePartitions based on spark.sql.shuffle.partitions which I think is safer.

I think it makes sense. Actually, during our internal practice, we set the default value of maxNumPostShufflePartitions = 1.5 * spark.sql.shuffle.partitions. But for the common code here, a magic number of 1.5 here cause doubt, maybe we need a discussion for the strategy of setting an appropriate default value for maxNumPostShufflePartitions based on spark.sql.shuffle.partitions.

Thanks @xuanyuanking for the input! Instead of setting maxNumPostShufflePartitions based on a magic number 1.5X or 500, I would propose to add a conf to replace maxNumPostShufflePartitions, which is a ratio of spark.sql.shuffle.partitions. The default value could be 1.0 so the behavior of AE's initial partition number is consistent with spark.sql.shuffle.partitions in non-AE mode. With 1.5 or 2, one of my concern is it could potentially bring an increase in shuffle service load when we enable AE as cluster default, as we have seen shuffle service scalability issues in our cluster when handling very large shuffle workloads.
cc @carsonwang @cloud-fan

It is a good point we should not break anything when AE is enabled by default at cluster level. Currently it is enabled only for test purpose. But it is possible we enable it by default in future. What about we set maxNumPostShufflePartitions to spark.sql.shuffle.partitions by default? We are currently using the maxNumPostShufflePartitions as an initial partition number. But in future, if we can find a better initial partition number between minNumPostShufflePartitions and maxNumPostShufflePartitions at runtime, maxNumPostShufflePartitions will be treated as a max limit.

@carsonwang, great, I support setting maxNumPostShufflePartitions = spark.sql.shuffle.partitions by default. This is exactly what I did internally when testing AE. Based on @xuanyuanking 's scenario, setting maxNumPostShufflePartitions as a configurable ratio to spark.sql.shuffle.partitions would also make sense to me.

fangshil · 2019-03-04T07:50:46Z

Excited to see AE making progress in upstream:) We have used the new AE framework to add SQL optimization rules and the result looks very promising. We have a few comments for this patch in general:

The current patch handles shuffle parallelism on reducer side, as it starts with a relatively large number of mapper partitions(500), and merge into fewer reducer partitions by allowing each reducer to read multiple mappers. For large data scale, setting 10K to spark.sql.shuffle.partitions in non-AE VS maxNumPostShufflePartitions in AE should have same results since the reducer number won't change when data is large. I think with this patch, we haven't got the optimal performance since we only save the overhead of launching a certain number reduce tasks. A better approach would be dynamically estimating the initial/mapper parallelism between 0 and maxNumPostShufflePartitions. This should be made possible by AE as well, while this patch should be a solid foundation for future improvements. Hope we can merge it soon!
This patch uses submitMapStage API. The API would submit each stage as a new job, so AE breaks Spark's vanilla definition of a job. This is an issue we inherit from the original AE, not originating from this new AE.

carsonwang · 2019-03-04T10:18:14Z

@fangshil , thanks for the suggestions and feedbacks! When measuring the performance, please also include the patch #19788 which has a big impact on the AE performance. As you suggested, dynamically estimating the initial parallelism will be helpful and it will be much easier for users to use AE. We also did some works about that and can contribute in future PRs.

justinuang · 2019-03-13T15:50:30Z

@carsonwang What happens when we call df.repartition(500) on a 10MB with AQE turned on? AQE will still ignore the explicit repartition right? This might be unintuitive to users.

Perhaps we can provide an option to let people decide whether they want AQE to apply to the repartition call?

cloud-fan · 2019-03-15T08:47:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("The advisory minimum number of post-shuffle partitions used in adaptive execution.")
      .intConf
-      .createWithDefault(-1)
+      .checkValue(numPartitions => numPartitions > 0, "The minimum shuffle partition number " +


super nit: we can simply write _ > 0

cloud-fan · 2019-03-15T08:48:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.adaptive.maxNumPostShufflePartitions")
+      .doc("The advisory maximum number of post-shuffle partitions used in adaptive execution.")
+      .intConf
+      .checkValue(numPartitions => numPartitions > 0, "The maximum shuffle partition number " +


cloud-fan · 2019-03-15T08:50:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala

+ * There are 2 kinds of query stages:
+ *   1. Shuffle query stage. This stage materializes its output to shuffle files, and Spark launches
+ *      another job to execute the further operators.
+ *   2. Broadcast stage. This stage materializes its output to an array in driver JVM. Spark


nit: Broadcast query stage

cloud-fan · 2019-03-15T09:15:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/ReduceNumShufflePartitionsSuite.scala

      spark.sql("SET spark.sql.exchange.reuse=true")
      val df = spark.range(1).selectExpr("id AS key", "id AS value")
+
+      // test case 1: a fragment has 3 child fragments but they are the same fragment.


nit: fragment -> query stage

SparkQA · 2019-03-15T17:58:09Z

Test build #103545 has finished for PR 20303 at commit 028b0ac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gczsjdy · 2019-03-17T14:38:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala

-
-    val simpleNodeName = "Exchange"
-    s"$simpleNodeName$extraInfo"
+    "Exchange"


:nit No need {}

gczsjdy · 2019-03-17T14:46:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageManager.scala

+ *
+ * When one query stage finishes materialization, a list of adaptive optimizer rules will be
+ * executed, trying to optimize the query plan with the data statistics collected from the the
+ * materialized data. Then we travers the query plan again and try to insert more query stages.


:nit traverse

gczsjdy · 2019-03-17T15:04:40Z

...re/src/main/scala/org/apache/spark/sql/execution/adaptive/rule/RemoveRedundantShuffles.scala

+  override def apply(plan: SparkPlan): SparkPlan = plan.transformUp {
+    case shuffle @ ShuffleExchangeExec(upper: HashPartitioning, child) =>
+      child.outputPartitioning match {
+        case lower: HashPartitioning if upper.semanticEquals(lower) => child


Will there be any difference if we judge by lower.satisfies(fatherOperator.requiredDistribution)?

This is copied from EnsureRequirements, but I think there is a difference: the number of partitions matters in semanticEquals. That said, lower.satisfies(fatherOperator.requiredDistribution) is more aggressive and may remove user-specified shuffle via something like df.partitionBy

gczsjdy · 2019-03-17T16:15:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageManager.scala

+      }
+  }
+
+  private def createQueryStage(e: Exchange): QueryStageExec = {


The three createQueryStage(s) brought some confusion...

gczsjdy · 2019-03-17T16:30:51Z

...src/main/scala/org/apache/spark/sql/execution/adaptive/rule/ReduceNumShufflePartitions.scala

+    // number of partitions, they will have the same number of pre-shuffle partitions
+    // (i.e. map output partitions).
+    assert(
+      distinctNumPreShufflePartitions.length == 1,


In most cases this has to be 1, but in cases when the children QueryStages are not in one whole-stage code generation block, we can still do adaptive execution even if the distinct value is larger than 1. For example, the root operator is a Union(it doesn't support codegen), and the two children are both ShuffleExchanges. In this case, the 2 ShuffleExchanges don't have to share the same number of pre shuffle partitions. They can reduce post shuffle partitions separately. I am not sure if I think it right. cc @cloud-fan

I think this is a long-standing issue and I don't have a good idea to deal with Union and Join differently.

SparkQA · 2019-03-22T07:05:02Z

Test build #103807 has finished for PR 20303 at commit 2e08778.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

carsonwang · 2019-03-22T07:41:47Z

@justinuang , sorry for late reply. For df.repartition(500), AE does use the specified number for repartition. That is, each stage will writes the map output with 500 partitions. However, in the following up stage, AE may launch tasks less than 500 as one task can process multiple continues blocks.

justinuang · 2019-03-25T21:34:14Z

@carsonwang we found a bug in production when AQE is turned on:

Here is a case where the ShuffleQueryStageInputs to a Union node will have differing number of partitions if we explicitly repartition them.

Here is a repro

      val sparkSession = SparkSession.builder()
        .master("local[2]")
        .config("spark.sql.autoBroadcastJoinThreshold", "-1")
        .config("spark.sql.adaptive.enabled", "true")
        .getOrCreate();

      val dataset1 = sparkSession.range(1000);
      val dataset2 = sparkSession.range(1001);

      val compute = dataset1.repartition(505, dataset1.col("id"))
        .union(dataset2.repartition(105, dataset2.col("id")))

      compute.show()
      compute.explain()

== Parsed Logical Plan ==
Union
:- AnalysisBarrier RepartitionByExpression [id#152L], 505
+- AnalysisBarrier RepartitionByExpression [id#155L], 105

== Analyzed Logical Plan ==
id: bigint
Union
:- RepartitionByExpression [id#152L], 505
:  +- Range (0, 1000, step=1, splits=Some(2))
+- RepartitionByExpression [id#155L], 105
   +- Range (0, 1001, step=1, splits=Some(2))

== Optimized Logical Plan ==
Union
:- RepartitionByExpression [id#152L], 505
:  +- Range (0, 1000, step=1, splits=Some(2))
+- RepartitionByExpression [id#155L], 105
   +- Range (0, 1001, step=1, splits=Some(2))

== Physical Plan ==
*Union
:- *Exchange hashpartitioning(id#152L, 505)
:  +- *Range (0, 1000, step=1, splits=2)
+- *Exchange hashpartitioning(id#155L, 105)
   +- *Range (0, 1001, step=1, splits=2)

assertion failed: There should be only one distinct value of the number pre-shuffle partitions among registered Exchange operator.
java.lang.AssertionError: assertion failed: There should be only one distinct value of the number pre-shuffle partitions among registered Exchange operator.
	at scala.Predef$.assert(Predef.scala:170)
	at org.apache.spark.sql.execution.exchange.ExchangeCoordinator.estimatePartitionStartIndices(ExchangeCoordinator.scala:119)
	at org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:104)
	at org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:138)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3262)
...

My immediate thought was to get rid of the assert and instead skip automatic repartitioning if the number of input partitions are different. There might be a better way to fix this though, haven't given it much thought yet.

justinuang · 2019-04-09T19:34:13Z

Ping =) Any thoughts on the post above? Unfortunately this meant that we had to revert AQE in our fork.

carsonwang · 2019-04-11T09:31:54Z

@justinuang , in theory we can handle Union separately and remove that limitation. But for now I think we can to skip changing the reducer number just as you mentioned.

justinuang · 2019-05-09T19:58:23Z

@carsonwang are there plans to continue work on this PR?

carsonwang · 2019-05-14T01:25:54Z

@justinuang , we are still waiting some updates from @cloud-fan

jerrychenhf · 2019-06-06T00:42:45Z

@carsonwang, Are we still working on this pull request? @cloud-fan, @justinuang
I saw another pull request #24706 @maryannxue is in working. Does that one deprecated this one and we should work and discuss on #24706?

gatorsmile · 2019-06-06T22:31:09Z

@jerrychenhf This #24706 is to implement a framework for adaptive query execution based on this PR. Please review this PR: #24706

tgravescs · 2019-07-09T13:26:44Z

assume this can be closed?

carsonwang · 2019-07-09T23:27:58Z

yes, closing this in favor of #24706

wangyum reviewed Jan 26, 2018

View reviewed changes

cloud-fan reviewed Jan 10, 2019

View reviewed changes

carsonwang and others added 10 commits January 11, 2019 16:06

Add QueryStage and the framework for adaptive execution

7f0c2c9

update style

4a9d054

set correct execution Id for broadcast query stage (apache#50)

3487eb8

Add comments to PlanQueryStage and QueryStage

7df45f8

fix build

63fece9

fix build2

52c7616

Fix test error

1081a3f

update comments

4a2311c

update comments

2c55985

carsonwang force-pushed the AE_1 branch from 603c6d5 to 2c55985 Compare January 15, 2019 14:52

Merge branch 'master' into AE_1

5819826

simplify QueryStage (#5)

ea93dbf

* do not re-implement exchange reuse * simplify QueryStage * add comments * new idea * polish * address comments * improve QueryStageTrigger

add import

bef8ab8

enable AE for testing

2d6f110

fangshil reviewed Mar 4, 2019

View reviewed changes

cloud-fan and others added 2 commits March 15, 2019 15:21

improve (#9)

fd413d4

rename

028b0ac

cloud-fan reviewed Mar 15, 2019

View reviewed changes

gczsjdy reviewed Mar 17, 2019

View reviewed changes

address comments

2e08778

dongjoon-hyun added the SQL label Jun 14, 2019

carsonwang closed this Jul 9, 2019

[SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL #20303

[SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL #20303

Uh oh!

Conversation

carsonwang commented Jan 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

carsonwang commented Jan 18, 2018

Uh oh!

SparkQA commented Jan 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 26, 2018

Uh oh!

carsonwang commented Jan 29, 2018

Uh oh!

SparkQA commented Jan 29, 2018

Uh oh!

SparkQA commented Feb 9, 2018

Uh oh!

aaron-aa commented Oct 4, 2018

Uh oh!

carsonwang commented Oct 11, 2018

Uh oh!

aaron-aa commented Oct 15, 2018

Uh oh!

carsonwang commented Nov 16, 2018

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carsonwang commented Jan 10, 2019

Uh oh!

SparkQA commented Jan 15, 2019

Uh oh!

SparkQA commented Jan 15, 2019

Uh oh!

SparkQA commented Jan 22, 2019

Uh oh!

cloud-fan commented Jan 22, 2019

Uh oh!

SparkQA commented Feb 28, 2019

Uh oh!

SparkQA commented Feb 28, 2019

Uh oh!

fangshil Mar 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangshil Mar 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangshil Mar 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

fangshil Mar 4, 2019 •

edited

Loading

fangshil Mar 7, 2019 •

edited

Loading

fangshil Mar 10, 2019 •

edited

Loading

fangshil Mar 15, 2019 •

edited

Loading

fangshil commented Mar 4, 2019 •

edited

Loading

justinuang commented Mar 13, 2019 •

edited

Loading