Skip to content

Conversation

Udbhav30
Copy link
Contributor

What changes were proposed in this pull request?

Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

Why are the changes needed?

Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

Does this PR introduce any user-facing change?

Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

How was this patch tested?

new UTs added

Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

new UTs added

Closes apache#29387 from Udbhav30/tuncateTrash.

Authored-by: Udbhav30 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@Udbhav30
Copy link
Contributor Author

@dongjoon-hyun, I have now used the API which is supported in hadoop2.7 as well. Can you please review it.

@Udbhav30
Copy link
Contributor Author

cc @sunchao

Comment on lines 273 to 279
* Move data to trash if 'spark.sql.truncate.trash.enabled' is true
*/
def moveToTrashIfEnabled(
fs: FileSystem,
partitionPath: Path,
isTrashEnabled: Boolean,
hadoopConf: Configuration): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name and doc is actually not accurate. It will delete the path, not just move the data to trash. Actually it is short and can easily get it by looking into the code. But it is nicer if we give it correct method name and correct doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the description, please suggest if it is okay now.

@viirya viirya changed the title [SPARK-32481][CORE][SQL] Support truncate table to move data to trash [SPARK-32481][CORE][SQL][test-hadoop2.7][test-hive1.2] Support truncate table to move data to trash Aug 26, 2020
@viirya
Copy link
Member

viirya commented Aug 26, 2020

ok to test

@Tagar
Copy link

Tagar commented Aug 26, 2020

copying to the right PR -
@Udbhav30 generally, one user can have multiple trash directories.
The default one fs.getHomeDirectory() + ".Trash" as you mentioned, and there could be multiple non-default ones - one per encryption zone.
So each encryption zone's trash directory is encrypted with the same key and files can be moved to trash without reencryption.
For GDPR/ CCPA use cases we had some tables with PII created in an HDFS encryption zone and those couldn't use default trash location.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127935 has finished for PR 29552 at commit 84f7e95.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was reviewed and merged previously. I only have few comments about method name and doc.

@sunchao
Copy link
Member

sunchao commented Aug 26, 2020

For GDPR/ CCPA use cases we had some tables with PII created in an HDFS encryption zone and those couldn't use default trash location.

This code is only used in UT though and I don't think we are using EZ there?

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127934 has finished for PR 29552 at commit 6cf355a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127939 has finished for PR 29552 at commit 1062cb9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127937 has finished for PR 29552 at commit 47075de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Udbhav30
Copy link
Contributor Author

cc @viirya @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

Thank you, @Udbhav30 , @sunchao , @viirya , and @Tagar .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM except a few minor comments.

cc @HyukjinKwon and @gatorsmile for further advice, too.

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128037 has finished for PR 29552 at commit 2dd78a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you all again. Merged to master for Apache Spark 3.1.0 on December.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-32481][CORE][SQL][test-hadoop2.7][test-hive1.2] Support truncate table to move data to trash [SPARK-32481][CORE][SQL] Support truncate table to move data to trash Aug 30, 2020
@gatorsmile
Copy link
Member

gatorsmile commented Oct 12, 2020

I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex.

I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API Trash.moveToAppropriateTrash. Is the exception good? How about the performance when users are using the object store instead of HDFS? Will it impact the GDPR compliance?

I think we should not merge the PR without the design doc and implementation plan. Above just lists some questions I have. The other community members might have more relevant questions/issues we need to resolve.

HyukjinKwon pushed a commit that referenced this pull request Nov 23, 2020
…to trash"

### What changes were proposed in this pull request?

This reverts commit 065f173, which is not part of any released version. That is, this is an unreleased feature

### Why are the changes needed?

I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex.

I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API `Trash.moveToAppropriateTrash`. Is the exception good? How about the performance when users are using the object store instead of HDFS? Will it impact the GDPR compliance?

In sum, I think we should not merge the PR #29552 without the design doc and implementation plan. That is why I reverted it before the code freeze of Spark 3.1

### Does this PR introduce _any_ user-facing change?
Reverted the original commit

### How was this patch tested?
The existing tests.

Closes #30463 from gatorsmile/revertSpark-32481.

Authored-by: Xiao Li <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants