-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; new UTs added Closes apache#29387 from Udbhav30/tuncateTrash. Authored-by: Udbhav30 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun, I have now used the API which is supported in hadoop2.7 as well. Can you please review it. |
cc @sunchao |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
Show resolved
Hide resolved
* Move data to trash if 'spark.sql.truncate.trash.enabled' is true | ||
*/ | ||
def moveToTrashIfEnabled( | ||
fs: FileSystem, | ||
partitionPath: Path, | ||
isTrashEnabled: Boolean, | ||
hadoopConf: Configuration): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name and doc is actually not accurate. It will delete the path, not just move the data to trash. Actually it is short and can easily get it by looking into the code. But it is nicer if we give it correct method name and correct doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the description, please suggest if it is okay now.
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
ok to test |
copying to the right PR - |
Test build #127935 has finished for PR 29552 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was reviewed and merged previously. I only have few comments about method name and doc.
This code is only used in UT though and I don't think we are using EZ there? |
Test build #127934 has finished for PR 29552 at commit
|
Test build #127939 has finished for PR 29552 at commit
|
Test build #127937 has finished for PR 29552 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM except a few minor comments.
cc @HyukjinKwon and @gatorsmile for further advice, too.
Test build #128037 has finished for PR 29552 at commit
|
Thank you all again. Merged to master for Apache Spark 3.1.0 on December. |
I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex. I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API I think we should not merge the PR without the design doc and implementation plan. Above just lists some questions I have. The other community members might have more relevant questions/issues we need to resolve. |
…to trash" ### What changes were proposed in this pull request? This reverts commit 065f173, which is not part of any released version. That is, this is an unreleased feature ### Why are the changes needed? I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex. I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API `Trash.moveToAppropriateTrash`. Is the exception good? How about the performance when users are using the object store instead of HDFS? Will it impact the GDPR compliance? In sum, I think we should not merge the PR #29552 without the design doc and implementation plan. That is why I reverted it before the code freeze of Spark 3.1 ### Does this PR introduce _any_ user-facing change? Reverted the original commit ### How was this patch tested? The existing tests. Closes #30463 from gatorsmile/revertSpark-32481. Authored-by: Xiao Li <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.
Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.
Does this PR introduce any user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;
How was this patch tested?
new UTs added