-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD #21333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #90647 has finished for PR 21333 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change! I tested this out as well and verified that the shuffle doesn't happen. I did notice that this change wasn't reflected in the dataset API. Is that something that should be addressed in this change?
} | ||
} | ||
|
||
test("SPARK-23778: empty RDD in union should not produce a UnionRDD") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested when all input RDDs are empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When all RDDs are empty we are returning a UnionRDD
. Though in this case it is not a big issue, since a shuffle of an empty RDD is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test? just make sure we are safe when UnionRDD.rdds
is Nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, thanks.
@varuvish the Dataset API uses |
LGTM |
Test build #92098 has finished for PR 21333 at commit
|
thanks, merging to master! |
apache#21333 When a union is invoked on several RDDs of which one is an empty RDD, the result of the operation is a UnionRDD. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning. The PR ignores incoming empty RDDs in the union method.
…pty RDD apache#21333 When a union is invoked on several RDDs of which one is an empty RDD, the result of the operation is a UnionRDD. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning. The PR ignores incoming empty RDDs in the union method.
What changes were proposed in this pull request?
When a
union
is invoked on several RDDs of which one is an empty RDD, the result of the operation is aUnionRDD
. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning.The PR ignores incoming empty RDDs in the union method.
How was this patch tested?
added UT