[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29552

Udbhav30 · 2020-08-26T19:12:50Z

What changes were proposed in this pull request?

Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

Why are the changes needed?

Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

Does this PR introduce any user-facing change?

Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

How was this patch tested?

new UTs added

Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; new UTs added Closes apache#29387 from Udbhav30/tuncateTrash. Authored-by: Udbhav30 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Udbhav30 · 2020-08-26T19:14:37Z

@dongjoon-hyun, I have now used the API which is supported in hadoop2.7 as well. Can you please review it.

Udbhav30 · 2020-08-26T19:16:37Z

cc @sunchao

core/src/main/scala/org/apache/spark/util/Utils.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

viirya · 2020-08-26T19:26:35Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+   * Move data to trash if 'spark.sql.truncate.trash.enabled' is true
+   */
+  def moveToTrashIfEnabled(
+      fs: FileSystem,
+      partitionPath: Path,
+      isTrashEnabled: Boolean,
+      hadoopConf: Configuration): Boolean = {


This method name and doc is actually not accurate. It will delete the path, not just move the data to trash. Actually it is short and can easily get it by looking into the code. But it is nicer if we give it correct method name and correct doc.

I have updated the description, please suggest if it is okay now.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

viirya · 2020-08-26T19:30:35Z

ok to test

Tagar · 2020-08-26T19:31:28Z

copying to the right PR -
@Udbhav30 generally, one user can have multiple trash directories.
The default one fs.getHomeDirectory() + ".Trash" as you mentioned, and there could be multiple non-default ones - one per encryption zone.
So each encryption zone's trash directory is encrypted with the same key and files can be moved to trash without reencryption.
For GDPR/ CCPA use cases we had some tables with PII created in an HDFS encryption zone and those couldn't use default trash location.

SparkQA · 2020-08-26T19:53:20Z

Test build #127935 has finished for PR 29552 at commit 84f7e95.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/util/Utils.scala

viirya

This was reviewed and merged previously. I only have few comments about method name and doc.

sunchao · 2020-08-26T20:18:11Z

For GDPR/ CCPA use cases we had some tables with PII created in an HDFS encryption zone and those couldn't use default trash location.

This code is only used in UT though and I don't think we are using EZ there?

SparkQA · 2020-08-26T22:31:20Z

Test build #127934 has finished for PR 29552 at commit 6cf355a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-26T22:48:02Z

Test build #127939 has finished for PR 29552 at commit 1062cb9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-26T22:56:29Z

Test build #127937 has finished for PR 29552 at commit 47075de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Udbhav30 · 2020-08-29T16:31:10Z

cc @viirya @dongjoon-hyun

dongjoon-hyun · 2020-08-29T23:03:32Z

Thank you, @Udbhav30 , @sunchao , @viirya , and @Tagar .

core/src/main/scala/org/apache/spark/util/Utils.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun

+1, LGTM except a few minor comments.

cc @HyukjinKwon and @gatorsmile for further advice, too.

SparkQA · 2020-08-30T10:01:02Z

Test build #128037 has finished for PR 29552 at commit 2dd78a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-30T17:25:04Z

Thank you all again. Merged to master for Apache Spark 3.1.0 on December.

gatorsmile · 2020-10-12T05:05:42Z

I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex.

I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API Trash.moveToAppropriateTrash. Is the exception good? How about the performance when users are using the object store instead of HDFS? Will it impact the GDPR compliance?

I think we should not merge the PR without the design doc and implementation plan. Above just lists some questions I have. The other community members might have more relevant questions/issues we need to resolve.

…to trash" ### What changes were proposed in this pull request? This reverts commit 065f173, which is not part of any released version. That is, this is an unreleased feature ### Why are the changes needed? I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex. I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API `Trash.moveToAppropriateTrash`. Is the exception good? How about the performance when users are using the object store instead of HDFS? Will it impact the GDPR compliance? In sum, I think we should not merge the PR #29552 without the design doc and implementation plan. That is why I reverted it before the code freeze of Spark 3.1 ### Does this PR introduce _any_ user-facing change? Reverted the original commit ### How was this patch tested? The existing tests. Closes #30463 from gatorsmile/revertSpark-32481. Authored-by: Xiao Li <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

probot-autolabeler bot added CORE SQL labels Aug 26, 2020

Udbhav30 mentioned this pull request Aug 26, 2020

[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29387

Closed

sunchao reviewed Aug 26, 2020

View reviewed changes

viirya reviewed Aug 26, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

viirya changed the title ~~[SPARK-32481][CORE][SQL] Support truncate table to move data to trash~~ [SPARK-32481][CORE][SQL][test-hadoop2.7][test-hive1.2] Support truncate table to move data to trash Aug 26, 2020

Handle review comments

47075de

Udbhav30 force-pushed the truncate branch from 84f7e95 to 47075de Compare August 26, 2020 19:57

viirya reviewed Aug 26, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 26, 2020

View reviewed changes

Handle review comments

1062cb9

dongjoon-hyun reviewed Aug 29, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Show resolved Hide resolved

dongjoon-hyun reviewed Aug 29, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 29, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 29, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun approved these changes Aug 29, 2020

View reviewed changes

minor comment fixes

2dd78a4

dongjoon-hyun changed the title ~~[SPARK-32481][CORE][SQL][test-hadoop2.7][test-hive1.2] Support truncate table to move data to trash~~ [SPARK-32481][CORE][SQL] Support truncate table to move data to trash Aug 30, 2020

dongjoon-hyun closed this in 065f173 Aug 30, 2020

gatorsmile mentioned this pull request Nov 23, 2020

Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash" #30463

Closed

HyukjinKwon mentioned this pull request Feb 21, 2022

[SPARK-37894][SQL] Add trash feature to FileCommitProtocol.deleteWithJob #35188

Closed

[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29552

[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29552

Uh oh!

Conversation

Udbhav30 commented Aug 26, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Udbhav30 commented Aug 26, 2020

Uh oh!

Udbhav30 commented Aug 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viirya Aug 26, 2020

Choose a reason for hiding this comment

Uh oh!

Udbhav30 Aug 26, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viirya commented Aug 26, 2020

Uh oh!

Tagar commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

Udbhav30 commented Aug 29, 2020

Uh oh!

dongjoon-hyun commented Aug 29, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 30, 2020

Uh oh!

dongjoon-hyun commented Aug 30, 2020

Uh oh!

gatorsmile commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gatorsmile commented Oct 12, 2020 •

edited

Loading