Skip to content

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

When a union is invoked on several RDDs of which one is an empty RDD, the result of the operation is a UnionRDD. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning.

The PR ignores incoming empty RDDs in the union method.

How was this patch tested?

added UT

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90647 has finished for PR 21333 at commit f67a88d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link

@varunvish910 varunvish910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change! I tested this out as well and verified that the shuffle doesn't happen. I did notice that this change wasn't reflected in the dataset API. Is that something that should be addressed in this change?

}
}

test("SPARK-23778: empty RDD in union should not produce a UnionRDD") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested when all input RDDs are empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When all RDDs are empty we are returning a UnionRDD. Though in this case it is not a big issue, since a shuffle of an empty RDD is not an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test? just make sure we are safe when UnionRDD.rdds is Nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, thanks.

@mgaido91
Copy link
Contributor Author

@varuvish the Dataset API uses sparkContext.union under the hood, so it is addressed as well by the current change.

@mgaido91
Copy link
Contributor Author

cc @cloud-fan @JoshRosen

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jun 19, 2018

Test build #92098 has finished for PR 21333 at commit 7f16ea0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in bc11146 Jun 20, 2018
zzcclp added a commit to zzcclp/spark that referenced this pull request Dec 6, 2018
apache#21333

When a union is invoked on several RDDs of which one is an empty RDD, the result of the operation is a UnionRDD. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning.

The PR ignores incoming empty RDDs in the union method.
zzcclp added a commit to zzcclp/spark that referenced this pull request Sep 20, 2019
…pty RDD apache#21333

When a union is invoked on several RDDs of which one is an empty RDD, the result of the operation is a UnionRDD. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning.

The PR ignores incoming empty RDDs in the union method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants