[SPARK-32276][SQL] Remove redundant sorts before repartition nodes #29089

aokolnychyi · 2020-07-13T17:00:06Z

What changes were proposed in this pull request?

This PR proposes to remove redundant sorts before repartition nodes whenever the data is ordered after the repartitioning.

Why are the changes needed?

It looks like our EliminateSorts rule can be extended further to remove sorts before repartition nodes that don't affect the final output ordering. It seems safe to perform the following rewrites:

Sort -> Repartition -> Sort -> Scan as Sort -> Repartition -> Scan
Sort -> Repartition -> Project -> Sort -> Scan as Sort -> Repartition -> Project -> Scan

Does this PR introduce any user-facing change?

No.

How was this patch tested?

More test cases.

aokolnychyi · 2020-07-13T17:02:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

      j.copy(left = recursiveRemoveSort(originLeft), right = recursiveRemoveSort(originRight))
    case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
      g.copy(child = recursiveRemoveSort(originChild))
+    case r: RepartitionByExpression =>


These two branches can be replaces with one:

case r: RepartitionOperation => r.withNewChildren(r.children.map(recursiveRemoveSort))

It will mean all repartition nodes we add in the future will be also taken into account. It seems safe but I want to hear what everybody thinks.

What about adding RepartitionOperation.preservesOrder? Then we could collapse these cases while also excluding Coalesce and making this explicit for future repartition operators.

I am +1 on having a repartition node that preserves ordering. In fact, we have such a node internally. Coalesce is not really order preserving, though. It has custom logic in DefaultPartitionCoalescer that gets applied if the parent RDD has locality info. We would also need to report outputOrdering correctly (which is not done in case of Coalesce now)

We won't be able to squash all cases into one as we need to check if repartition expressions are deterministic. However, I'd consider extending repartition nodes with order preserving repartition in a follow-up if there is enough support for that.

aokolnychyi · 2020-07-13T17:05:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala


-    val a_i = List[Int](1, -1, 2, -2, 2147483647, -2147483648)
-    val b_i = List[Option[Int]](Some(1), None, None, Some(-2), None, Some(-2147483648))
+    // we order values by $"a_i".desc manually as sortBy before coalesce is ignored


This is the case I mention in the PR description. Here, DefaultPartitionCoalescer does preserve the ordering and the test relied on that even though there is no guarantee it will happen. We could apply the new optimization only if the repartition operation requires a shuffle. That way, we will keep the existing behavior.

I think we should apply the optimization only if the repartition requires a shuffle as you suggest. I know that there are users that depend on this behavior.

I lean towards that as well.

aokolnychyi · 2020-07-13T17:43:46Z

cc @dongjoon-hyun @dbtsai @cloud-fan @viirya @gengliangwang for feedback

dongjoon-hyun · 2020-07-13T18:15:18Z

Thank you for pinging me, @aokolnychyi .

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

SparkQA · 2020-07-13T22:44:53Z

Test build #125781 has finished for PR 29089 at commit d71b9e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2020-07-13T23:11:22Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala

    comparePlans(optimized, correctAnswer)
  }
+
+  testRepartitionOptimization(


I prefer not to replace test because it makes tests difficult to run individually (at least in my IntelliJ environment).

It also tends to increase readability. Here, you're passing a function to testRepartitionOptimization that gets passed a function that modifies the logical plan. I think it would be easier to read if these were separate suites, with a suite-level repartition function:

class EliminateSortsInRepartitionSuite extends ... { def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10) test("remove sortBy") { val plan = testRelation.select('a, 'b).sortBy('a.asc, 'b.desc) val planWithRepartition = repartition(plan) ... } } class EliminateSortsInRepartitionByExpressionSuite extends EliminateSortsInRepartitionSuite { override def repartition(plan: LogicalPlan): LogicalPlan = plan.distribute('a, 'b)(10) }

I think this pattern is common for the codebase but I agree having separate suites makes more sense here. Updated.

...est/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala

aokolnychyi · 2020-07-14T18:42:35Z

@dongjoon-hyun @viirya @rdblue this PR is ready for another review round.

...est/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala

rdblue · 2020-07-14T18:47:16Z

LGTM. I just had one question in the tests.

dongjoon-hyun · 2020-07-14T19:57:33Z

...est/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala

+  def repartition(plan: LogicalPlan): LogicalPlan = plan.repartition(10)
+  def isOptimized: Boolean = true
+
+  test(s"sortBy") {


nit. s" -> ".

Good catch, these are leftovers from the old version of tests.

dongjoon-hyun · 2020-07-14T19:57:43Z

...est/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala

+    comparePlans(optimizedPlan, analyzer.execute(correctPlan))
+  }
+
+  test(s"sortBy with projection") {


SparkQA · 2020-07-14T20:31:58Z

Test build #125847 has finished for PR 29089 at commit 83791b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-14T21:31:22Z

Some tests in EliminateSortsSuite were failed?

aokolnychyi · 2020-07-14T21:35:30Z

Those failures seem to belong to the old commit before this change where I removed old tests. Tests for coalesce that failed were no longer valid.

SparkQA · 2020-07-14T23:57:27Z

Test build #125849 has finished for PR 29089 at commit 0ff3092.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-15T00:08:16Z

Test build #125848 has finished for PR 29089 at commit c58ad12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-15T01:43:07Z

Test build #125858 has finished for PR 29089 at commit ba6a1bb.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @aokolnychyi , @rdblue , @viirya , @maropu .
Merged to master for Apache Spark 3.1.0 on December 2020.
The tests already passed. Last two comments are about commenting and changing test case names.

dongjoon-hyun · 2020-07-15T04:27:18Z

Also, cc @gatorsmile and @cloud-fan

aokolnychyi · 2020-07-15T04:28:29Z

Thanks, everyone!

dongjoon-hyun · 2020-07-15T04:49:33Z

Oops. Sorry, guys. It seems that I missed something during testing. For the following case, we should not remove Sort.

BEFORE THIS PR

scala> Seq((1,10),(1,20),(2,30),(2,40)).toDF("a", "b").repartition(2).createOrReplaceTempView("t")

scala> sql("select * from (select * from t order by b desc) distribute by a").show()
+---+---+
|  a|  b|
+---+---+
|  1| 20|
|  1| 10|
|  2| 40|
|  2| 30|
+---+---+

AFTER THIS PR

scala> Seq((1,10),(1,20),(2,30),(2,40)).toDF("a", "b").repartition(2).createOrReplaceTempView("t")

scala> sql("select * from (select * from t order by b desc) distribute by a").show()
+---+---+
|  a|  b|
+---+---+
|  1| 10|
|  1| 20|
|  2| 30|
|  2| 40|
+---+---+

dongjoon-hyun · 2020-07-15T04:52:35Z

To generate small final Parquet/ORC files, we do the above tricks, don't we? This PR may cause a regression on the size of output storage.

aokolnychyi · 2020-07-15T15:39:40Z

The same question for local sort too.

sql("select * from (select * from (select * from t order by b desc) distribute by a) sort by b asc")

Sort [b#6 ASC NULLS FIRST], false
+- RepartitionByExpression [a#5], 4
   +- Sort [b#6 DESC NULLS LAST], true
      +- Repartition 2, true
         +- LocalRelation [a#5, b#6]

SparkQA · 2020-07-15T16:48:48Z

Test build #125886 has finished for PR 29089 at commit ba6a1bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-15T17:02:13Z

In general, I think we can remove sort if it doesn't affect the final output ordering. The case caught by @dongjoon-hyun is a good example: the final output ordering changes and affect the file size.

rdblue · 2020-07-15T17:27:35Z

To generate small final Parquet/ORC files, we do the above tricks, don't we?

We don't rely on this. Our recommendation to users is to add a global sort to distribute the data, which adds the local sort in the final stage that won't be removed. I can understand people relying on this behavior, though.

For now, I think it makes sense to remove a sort before a repartition if the data will be sorted later, like what I think @aokolnychyi is suggesting. That's really what we will need for tables that require a sort order -- that will be the final sort and we should be able to remove other sorts.

We may also want to choose whether this is a guarantee and document it.

aokolnychyi · 2020-07-15T18:32:09Z

Yes, my proposal is to optimize cases when we sort the data after the repartition like in the examples I gave above. In those cases, sorts below seem to be redundant.

aokolnychyi · 2020-07-15T18:33:09Z

@dongjoon-hyun @viirya @hvanhovell @maropu, what do you think?

viirya · 2020-07-15T22:12:53Z

It looks reasonable to me to remove a sort before a repartition if we know the data will be sorted later, e.g. @aokolnychyi's examples above.

aokolnychyi · 2020-07-16T00:32:01Z

I've updated the PR to show what I meant. I'll check for additional edge cases in the morning but the change is ready for review.

dongjoon-hyun · 2020-07-16T04:14:51Z

Thank you for quick updating, @aokolnychyi . Also, thank you all for your opinions.

SparkQA · 2020-07-16T06:43:10Z

Test build #125931 has finished for PR 29089 at commit 21a84ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-16T21:26:27Z

Test build #125986 has finished for PR 29089 at commit 0545b09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aokolnychyi · 2020-07-17T01:54:04Z

I gave it a bit of thought and did not find a case where the updated logic would break.

dongjoon-hyun · 2020-07-17T15:45:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ * 4) if the Sort operator is within Join separated by 0...n Project/Filter/Repartition
+ *    operators only, and the Join conditions is deterministic
+ * 5) if the Sort operator is within GroupBy separated by 0...n Project/Filter/Repartition
+ *    operators only, and the aggregate function is order irrelevant


This documentation update seems to focus on case _: Repartition => true only. Could you revise more to cover case r: RepartitionByExpression => r.partitionExpressions.forall(_.deterministic), please?

dongjoon-hyun · 2020-07-17T15:57:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

 }

 /**
 * Removes Sort operation. This can happen:


Shall we revise this line?

- Removes Sort operation. This can happen: + Removes Sort operation if it doesn't affect the final output ordering. + Note that the final output ordering changes and affect the file size (SPARK-32318). + This optimizer handles the following cases:

SparkQA · 2020-07-17T21:23:59Z

Test build #126060 has finished for PR 29089 at commit 2157a71.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-19T18:49:27Z

Retest this please.

dongjoon-hyun

+1, LGTM. Thank you so much, @aokolnychyi and all.
The contribution of this PR is not only on improving the optimizer but also adding more test coverage.
Merged to master.

dongjoon-hyun · 2020-07-19T19:11:51Z

cc @cloud-fan and @gatorsmile once more.

hvanhovell · 2020-07-19T19:51:54Z

@dongjoon-hyun I am a bit late with my response but here goes :)

However, the following is not reasonable. There is nothing wrong in the file formats. They are just consumers and showing a better performance in a sorted input sequence because they are columnar vectorized format. I guess you assume that this is only a behavior at ORC. But, I'm sure that you can find your customers are relying on this in Parquet, too.

That is making the argument for explicitly organizing the data before the write right? You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.

This is not an implicit system behavior in Apache Spark. Apache Spark has been working in the procedural ways as you see in the above. If we start to ignore the valid working pattern in the production, it becomes a huge regression.
In short, saving to a file is a totally different and valid story. To optimize the final output files, the above pattern have been used in the production among Apache Spark users for a long time. If some optimizer rule ignores the existing usage, this ends up at a large regression in terms of the cost (for example, S3) obviously.

If you generalize the procedural argument then we also should not do things like join reordering or swapping window operators. The whole point of a declarative system like Spark SQL is that you don't care about how the system executes a query, and that it has the freedom move operations around to make execution more optimal.

Have you considered that your regression is someone else's speed-up? Sorting is not free, and if we can avoid it we should. There might a large group of users that are adversely affected by spurious sorts in the queries (e.g. an order by in a view).

Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.

SparkQA · 2020-07-19T23:40:53Z

Test build #126134 has finished for PR 29089 at commit 2157a71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-20T21:31:35Z

@hvanhovell . Thank you for your feedback. The following looks a little wrong to me because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew many production usages like that. I almost forgot that, but it rang my head suddenly during this PR. (Sadly, after I merged this.)

You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself.

I understand your point of views fully. However, I'm wondering if you can persuade the customers to waste their storage by generating 160x bigger files (the example from SPARK-32318). Do you think you can?

-rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc

-rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc

.
For the following, SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough? If then, we can add more.

Finally I do want to point out that there is no mechanism that captures this regression if it pops up again.

dongjoon-hyun · 2020-07-20T21:41:19Z

For the following, I'd like to ask your help if you are interested. I believe we want to build the better Apache Spark in the community together.

If you generalize the procedural argument then we also should not do things like join reordering or swapping window operators.

probot-autolabeler bot added the SQL label Jul 13, 2020

aokolnychyi commented Jul 13, 2020

View reviewed changes

aokolnychyi mentioned this pull request Jul 13, 2020

[SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes #29066

Closed

dongjoon-hyun changed the title ~~[SQL][SPARK-32276] Remove redundant sorts before repartition nodes~~ [SPARK-32276][SQL] Remove redundant sorts before repartition nodes Jul 13, 2020

viirya reviewed Jul 13, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

rdblue reviewed Jul 13, 2020

View reviewed changes

aokolnychyi commented Jul 14, 2020

View reviewed changes

...est/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala Show resolved Hide resolved

rdblue reviewed Jul 14, 2020

View reviewed changes

...est/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsBeforeRepartitionSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 14, 2020

View reviewed changes

viirya approved these changes Jul 14, 2020

View reviewed changes

maropu approved these changes Jul 15, 2020

View reviewed changes

dongjoon-hyun approved these changes Jul 15, 2020

View reviewed changes

dongjoon-hyun closed this in af8e65f Jul 15, 2020

[SPARK-32276][SQL] Remove redundant sorts before repartition nodes

21a84ad

aokolnychyi force-pushed the spark-32276 branch from ba6a1bb to 21a84ad Compare July 16, 2020 00:21

Fix a typo in tests

0545b09

dongjoon-hyun reviewed Jul 17, 2020

View reviewed changes

aokolnychyi added 2 commits July 17, 2020 09:06

Update comment

64dd0bb

Fix one more comment

2157a71

dongjoon-hyun approved these changes Jul 19, 2020

View reviewed changes

dongjoon-hyun closed this in 0aca1a6 Jul 19, 2020

aokolnychyi mentioned this pull request Jan 8, 2021

[SPARK-34026][SQL] Inject repartition and sort nodes to satisfy required distribution and ordering #31083

Closed

Uh oh!

[SPARK-32276][SQL] Remove redundant sorts before repartition nodes #29089

[SPARK-32276][SQL] Remove redundant sorts before repartition nodes #29089

Uh oh!

Conversation

aokolnychyi commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jul 13, 2020

Uh oh!

dongjoon-hyun commented Jul 13, 2020

Uh oh!

Uh oh!

SparkQA commented Jul 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aokolnychyi commented Jul 14, 2020

Uh oh!

Uh oh!

rdblue commented Jul 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

viirya commented Jul 14, 2020

Uh oh!

aokolnychyi commented Jul 14, 2020

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

SparkQA commented Jul 15, 2020

Uh oh!

SparkQA commented Jul 15, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 15, 2020

Uh oh!

aokolnychyi commented Jul 15, 2020

Uh oh!

dongjoon-hyun commented Jul 15, 2020

Uh oh!

dongjoon-hyun commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

aokolnychyi commented Jul 13, 2020 •

edited

Loading

aokolnychyi Jul 13, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Jul 15, 2020 •

edited

Loading