Skip to content

Conversation

@sandeep-katta
Copy link
Contributor

@sandeep-katta sandeep-katta commented Sep 19, 2018

Modification content:If the database is created with location URI then should be empty/should not exists

What changes were proposed in this pull request?

If the user specifies the location while creating the Database then location should be empty or should not exists

for e.g create database db1 location '/user/hive/warehouse';
this command will be passed only if 'user/hive/warehouse' is empty or should not exists.

How was this patch tested?
Added testcases and also manually verified in the cluster

@sujith71955
Copy link
Contributor

@sandeep-katta can you please update the tile, i think as per your description it seems to be the data will be dropped if the location of the newly created database points to some existing database location.

@sandeep-katta sandeep-katta changed the title [SPARK-25464][SQL]When database is dropped all the data related to it is deleted [SPARK-25464][SQL]On dropping the Database it will drop the contents of other database if it is created using location URI(path of the DB2) Sep 21, 2018
@sandeep-katta sandeep-katta changed the title [SPARK-25464][SQL]On dropping the Database it will drop the contents of other database if it is created using location URI(path of the DB2) [SPARK-25464][SQL]On dropping the Database it will drop the contents of other database if it is created using the location of other database Sep 21, 2018
@sujith71955
Copy link
Contributor

@srowen @HyukjinKwon I think this can be a risk if the location of the newly created database points to an existing one, if user drop the db both the db data will be lost .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: user shall take care of deleting the data from external location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If the location is specified by the user then need to consider DB as external DB,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please pull finally to 1 level up inline with the close brace of try

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please pull catch to 1 level up inline with the close brace of try

@srowen
Copy link
Member

srowen commented Sep 21, 2018

See JIRA, I don't think this should be merged.

@sandeep-katta
Copy link
Contributor Author

Yes I agree 2 database should not point to same path,currently this is the loop hole in spark which is required to fix.If this solution is not okay ,then we can append the dbname.db to the location given by the user
for e.g
create database db1 location /user/hive/warehouse
then the location of the DB should be /user/hive/warehouse/db1.db

@sandeep-katta
Copy link
Contributor Author

See JIRA, I don't think this should be merged.

I have referred Databricks doc https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-database.html and implemented accordingly.Let me know if any suggesstion

@srowen
Copy link
Member

srowen commented Sep 24, 2018

We should look at Spark documentation, and Hive, if any, to figure out what the right behavior is here. Spark generally follows Hive. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable I think this is, further, conflating what LOCATION and EXTERNAL does. I agree that external DB files shouldn't be deleted, but not simply those specified by LOCATION. At least that is my understanding.
@yhuai or @cloud-fan or @clockfly might know more.

@cloud-fan
Copy link
Contributor

I'm not sure if there is a concept called "external database" in Hive...

@srowen
Copy link
Member

srowen commented Sep 24, 2018

There is ... see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ManagedandExternalTables

I think Spark conflates the two. It's rare (?) but possible to specify a custom location of a managed table, but, typically occurs for EXTERNAL tables. So maybe this is OK.

  private def createTable(tableIdent: TableIdentifier): Unit = {
    val storage = DataSource.buildStorageFormatFromOptions(extraOptions.toMap)
    val tableType = if (storage.locationUri.isDefined) {
      CatalogTableType.EXTERNAL
    } else {
      CatalogTableType.MANAGED
    }

And in SqlParser:

    // If location is defined, we'll assume this is an external table.
    // Otherwise, we may accidentally delete existing data.
    val tableType = if (external || location.isDefined) {
      CatalogTableType.EXTERNAL
    } else {
      CatalogTableType.MANAGED
    }

So if LOCATION implies EXTERNAL in Spark, then I get this. EXTERNAL tables shouldn't be deleted.

I agree that the Hive impl doesn't seem to take this into account, on the code paths that call dropDatabase. CC @andrewor14 in case he is available to comment on the original implementaiton. WDYT @cloud-fan

@cloud-fan
Copy link
Contributor

yea, in Spark we conflate the two and treat a table as external if location is specified.

However, Hive doesn't have external database, see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/Alter/UseDatabase

I don't want to introduce unnecessary behavior difference from hive, and I feel it's not very useful to have external database. Although your table files can be in an existing folder, but LIST TABLES will not work.

@srowen
Copy link
Member

srowen commented Sep 25, 2018

That link says Hive does support EXTERNAL. What am I missing? Well, in any event we aren't contemplating a behavior change here.

If you delete a table with LOCATION specified, what should happen? Hive would delete it I guess... unless it's EXTERNAL. Looks like Spark would delete it even when it's in an 'external' location.

If these are conflated I guess we weigh the surprise in not deleting 'local' locations for data when a table is dropped, vs surprise in deleting 'external' locations for data.

@cloud-fan
Copy link
Contributor

That link says Hive does support EXTERNAL. What am I missing?

Hive supports EXTERNAL only for tables, not databases.
The CREATE TABLE syntax:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ...

The CREATE DATABASE syntax:

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name ...

@srowen
Copy link
Member

srowen commented Sep 25, 2018

Owp, I've been misreading that several times. Right. Well by analogy, if a database has a non default LOCATION then so do it's tables, and they are treated like EXTERNAL tables. Dropping the DB means dropping the tables, and dropping those tables doesn't delete data. So should the same happen for DBs? Seems sensible, because the DB directory might not even be empty.

Still I feel like I'm missing something if it only comes up in the case that two DBs have the same location, which is going to cause a bunch of other problems.

But is it the right change simply because it's consistent with how dropping tables works?

@cloud-fan
Copy link
Contributor

This is a behavior change and makes us different from Hive. However I can't find a strong reason to do it. It's like importing a database, but we can't automatically create table entries in the metastore when creating a database with an existing location.

To me a more reasonable behavior is, fail earlier when creating a database with an existing and non-empty location.

@sandeep-katta
Copy link
Contributor Author

seems @cloud-fan comments are valid as it will not result in any behavior change, I will update the PR accordingly WDYT @srowen

@sandeep-katta sandeep-katta force-pushed the dropDatabse branch 2 times, most recently from 74629ad to 597f8e6 Compare September 26, 2018 05:06
@sandeep-katta
Copy link
Contributor Author

cc @cloud-fan @srowen I have updated the code,Please review

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Sep 26, 2018

Test build #96601 has finished for PR 22466 at commit 597f8e6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 26, 2018

Test build #96619 has finished for PR 22466 at commit 597f8e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we should do it here or in the external catalog, cc @gatorsmile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did here is, external catalog always have path (default or user specified),so we end up in checking the path always.This may be the costly operation if the file system is S3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to confirm it what is the behavior of Hive (3.x) when we try to create a database in a non-empty file path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try the following cases:

  • the directory does not exist and the parent directory does not exist too.
  • the directory exists but empty
  • the directory exists and non-empty
  • the directory is not accessible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only block the last two cases. I also prefer to doing it in the external catalog. In the future, if we support catalog federation, different external catalog can define their own behavior.

Listing files in S3 is expensive, but create a new database is not frequent. I think our users can accept this cost.

Copy link
Contributor Author

@sandeep-katta sandeep-katta Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to confirm it what is the behavior of Hive (3.x) when we try to create a database in a non-empty file path?

User can create the Database on the non-empty file path,results are below

host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin # ./hadoop fs -ls /user/hive/
Found 2 items
drwxr-xr-x - root supergroup 0 2018-09-27 18:05 /user/hive/dropTest
drwxr-xr-x - root supergroup 0 2018-09-27 17:19 /user/hive/warehouse

hive> create database testjira location '/user/hive/';
OK
Time taken: 0.187 seconds
hive> desc database testjira;
OK
testjira hdfs://localhost:9000/user/hive root USER
Time taken: 0.045 seconds, Fetched: 1 row(s)
hive> drop database testjira;
OK
Time taken: 0.194 seconds
hive>

host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin # ./hadoop fs -ls /user
host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin #

Copy link
Contributor Author

@sandeep-katta sandeep-katta Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try the following cases:

  • the directory does not exist and the parent directory does not exist too.
    spark-sql> create database noparent location '/user/dropTest/noparent/dblocation';
    Time taken: 2.246 seconds
  • the directory exists but empty
    spark-sql> create database emptydir location '/user/dropTest/emptydir';
    Time taken: 0.262 seconds
  • the directory exists and non-empty
    spark-sql> create database dirExistsAndNotempty location '/user/dropTest/';
    Error in query: Cannot create Database to the location which is not empty.;
  • the directory is not accessible
    spark-sql> create database userNoPermission location '/user/hive/dropTest';
    Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=sandeep, access=WRITE, inode="/user/hive":hive:hive:drwxr-x---
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:261)

@gatorsmile
Copy link
Member

Regarding the potentially high cost of file listing, CREATE DATABASE is not a frequent operation. The cost is high only if the target directory is non-empty with many many files. We are blocking users from creating such a database. Thus, the cost is not a big deal I think.

We need to list this behavior change in the SQL migration guide.

@sandeep-katta
Copy link
Contributor Author

@gatorsmile I have updated the SQL migration guide

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103124 has finished for PR 22466 at commit 6532466.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103122 has finished for PR 22466 at commit 6f4c9ce.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103130 has finished for PR 22466 at commit 6532466.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sandeep-katta
Copy link
Contributor Author

retest this please

1 similar comment
@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 9, 2019

Test build #103252 has finished for PR 22466 at commit 6532466.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sandeep-katta
Copy link
Contributor Author

cc @srowen @cloud-fan @gatorsmile

@sandeep-katta
Copy link
Contributor Author

ping @srowen @gatorsmile @ueshin @cloud-fan . Please check now this can be merged to master branch for spark-3.0 release . I am happy to do any more code/documentation changes if required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says 'path not empty' but that suggests 'path is not the empty string'. This means 'the path exists already' right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , it means path already exists, I will update the message accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change the message it will lead to confusion.
for e.g.
User can be able to create database to the location as long as the path does not exists or it should be empty.

If I say path already exits it will confuse, so I wanted to highlight the path is not empty

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to describe a case where the path is a directory that exists, but contains files? or a path that does not exist at all? either way I think that type of description is better. An empty path implies something different from empty directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I wanted to describe a case where the path is a directory that exists but contains files. if the above description is correct then I will update it

Modification content:Updated the sql-migration-guide to notify the user about behaviour change
@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108984 has finished for PR 22466 at commit 0a52bdc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon pushed a commit that referenced this pull request Sep 2, 2019
…rnalCatalogSuite

## What changes were proposed in this pull request?

drop the table after the test `query builtin functions don't call the external catalog`  executed

This is required for [SPARK-25464](#22466)

## How was this patch tested?

existing UT

Closes #25427 from sandeep-katta/cleanuptable.

Authored-by: sandeep katta <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jan 6, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 6, 2020
@srowen srowen closed this Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.