-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40354][SQL] Support eliminate dynamic partition for datasource v1 writes #37831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
030c209 to
bf77d3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is used to validate partial static partition, so use a column instead of literal to recover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have not supported hive write yet
bf77d3a to
bd13abf
Compare
| null | ||
| } else { | ||
| assert(part.isInstanceOf[UTF8String]) | ||
| ExternalCatalogUtils.getPartitionSpecString(part.asInstanceOf[UTF8String].toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null and empty partition are treated as null, and the final partition path will be __HIVE_DEFAULT_PARTITION__
|
cc @cloud-fan @viirya thank you |
| InsertIntoHadoopFsRelationCommand( | ||
| outputPath = outputPath, | ||
| staticPartitions = Map.empty, | ||
| staticPartitions = staticPartitions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more about the difference between write with dynamic partition columns and without? The code can be simplified quite a lot if we just need to remove sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main benefit is we can save a local sort. yes, I agree, the current implementation is over complex. The reason is I guess some downstream projects or extensions may depend on the static partitions(e.g. add repartition for dynamic partition writes), so I merge the infered static partitions into the original. I'm fine to simplify the code to just remove a sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the downstream projects can detect real dynamic partition columns by themselves. I'd prefer a simple solution here to just remove local sort.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Add a new optimizer rule:
EliminateV1DynamicPartitionWrites. This new rule supports convert dynamic partition writes to statis partition writes if the dynamic column is Literal which has been optimzied byConstantFoldingfor datasource table.Why are the changes needed?
If the partition column is actually foldable, it's is same with static partition. So there is no needed to do an extra local sort to ensure the same partition values are continuous.
Besides, the dataframe write api does not support specify the static partition spec so users always do dynamic partition writes, e.g.
Does this PR introduce any user-facing change?
no, improve the performance
How was this patch tested?
add tests