[SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD #21333

mgaido91 · 2018-05-15T14:46:02Z

What changes were proposed in this pull request?

When a union is invoked on several RDDs of which one is an empty RDD, the result of the operation is a UnionRDD. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning.

The PR ignores incoming empty RDDs in the union method.

How was this patch tested?

added UT

SparkQA · 2018-05-15T19:18:29Z

Test build #90647 has finished for PR 21333 at commit f67a88d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

varunvish910

Nice change! I tested this out as well and verified that the shuffle doesn't happen. I did notice that this change wasn't reflected in the dataset API. Is that something that should be addressed in this change?

jiangxb1987 · 2018-05-31T06:33:41Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

    }
  }

+  test("SPARK-23778: empty RDD in union should not produce a UnionRDD") {


Have we tested when all input RDDs are empty?

When all RDDs are empty we are returning a UnionRDD. Though in this case it is not a big issue, since a shuffle of an empty RDD is not an issue.

can we add a test? just make sure we are safe when UnionRDD.rdds is Nil

added, thanks.

mgaido91 · 2018-06-12T14:13:18Z

@varuvish the Dataset API uses sparkContext.union under the hood, so it is addressed as well by the current change.

mgaido91 · 2018-06-19T12:51:08Z

cc @cloud-fan @JoshRosen

cloud-fan · 2018-06-19T20:00:32Z

LGTM

SparkQA · 2018-06-19T23:22:00Z

Test build #92098 has finished for PR 21333 at commit 7f16ea0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-06-20T05:29:11Z

thanks, merging to master!

apache#21333 When a union is invoked on several RDDs of which one is an empty RDD, the result of the operation is a UnionRDD. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning. The PR ignores incoming empty RDDs in the union method.

…pty RDD apache#21333 When a union is invoked on several RDDs of which one is an empty RDD, the result of the operation is a UnionRDD. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning. The PR ignores incoming empty RDDs in the union method.

[SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD

f67a88d

varunvish910 approved these changes May 30, 2018

View reviewed changes

jiangxb1987 reviewed May 31, 2018

View reviewed changes

add test for all empty RDDs

7f16ea0

asfgit closed this in bc11146 Jun 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD #21333

[SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD #21333

Uh oh!

mgaido91 commented May 15, 2018

Uh oh!

SparkQA commented May 15, 2018

Uh oh!

varunvish910 left a comment

Uh oh!

jiangxb1987 May 31, 2018

Uh oh!

mgaido91 May 31, 2018

Uh oh!

cloud-fan Jun 19, 2018

Uh oh!

mgaido91 Jun 19, 2018

Uh oh!

mgaido91 commented Jun 12, 2018

Uh oh!

mgaido91 commented Jun 19, 2018

Uh oh!

cloud-fan commented Jun 19, 2018

Uh oh!

SparkQA commented Jun 19, 2018

Uh oh!

cloud-fan commented Jun 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD #21333

[SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD #21333

Uh oh!

Conversation

mgaido91 commented May 15, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 15, 2018

Uh oh!

varunvish910 left a comment

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 May 31, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 May 31, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 19, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 Jun 19, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Jun 12, 2018

Uh oh!

mgaido91 commented Jun 19, 2018

Uh oh!

cloud-fan commented Jun 19, 2018

Uh oh!

SparkQA commented Jun 19, 2018

Uh oh!

cloud-fan commented Jun 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants