Skip to content

Conversation

@ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Add a new optimizer rule: EliminateV1DynamicPartitionWrites. This new rule supports convert dynamic partition writes to statis partition writes if the dynamic column is Literal which has been optimzied by ConstantFolding for datasource table.

Why are the changes needed?

If the partition column is actually foldable, it's is same with static partition. So there is no needed to do an extra local sort to ensure the same partition values are continuous.

Besides, the dataframe write api does not support specify the static partition spec so users always do dynamic partition writes, e.g.

df.selectExpr("*", "x" as p).write.partitionBy("p").save

Does this PR introduce any user-facing change?

no, improve the performance

How was this patch tested?

add tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is used to validate partial static partition, so use a column instead of literal to recover it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not supported hive write yet

null
} else {
assert(part.isInstanceOf[UTF8String])
ExternalCatalogUtils.getPartitionSpecString(part.asInstanceOf[UTF8String].toString)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null and empty partition are treated as null, and the final partition path will be __HIVE_DEFAULT_PARTITION__

@ulysses-you
Copy link
Contributor Author

cc @cloud-fan @viirya thank you

InsertIntoHadoopFsRelationCommand(
outputPath = outputPath,
staticPartitions = Map.empty,
staticPartitions = staticPartitions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about the difference between write with dynamic partition columns and without? The code can be simplified quite a lot if we just need to remove sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main benefit is we can save a local sort. yes, I agree, the current implementation is over complex. The reason is I guess some downstream projects or extensions may depend on the static partitions(e.g. add repartition for dynamic partition writes), so I merge the infered static partitions into the original. I'm fine to simplify the code to just remove a sort.

Copy link
Contributor

@cloud-fan cloud-fan Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downstream projects can detect real dynamic partition columns by themselves. I'd prefer a simple solution here to just remove local sort.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 19, 2022
@github-actions github-actions bot closed this Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants