[SPARK-25464][SQL] Create Database to the location,only if it is empty or does not exists. #22466

sandeep-katta · 2018-09-19T11:47:00Z

Modification content:If the database is created with location URI then should be empty/should not exists

What changes were proposed in this pull request?

If the user specifies the location while creating the Database then location should be empty or should not exists

for e.g create database db1 location '/user/hive/warehouse';
this command will be passed only if 'user/hive/warehouse' is empty or should not exists.

How was this patch tested?
Added testcases and also manually verified in the cluster

sujith71955 · 2018-09-21T09:18:20Z

@sandeep-katta can you please update the tile, i think as per your description it seems to be the data will be dropped if the location of the newly created database points to some existing database location.

sujith71955 · 2018-09-21T09:31:13Z

@srowen @HyukjinKwon I think this can be a risk if the location of the newly created database points to an existing one, if user drop the db both the db data will be lost .

sujith71955 · 2018-09-21T09:39:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

nit: user shall take care of deleting the data from external location.

sujith71955 · 2018-09-21T09:40:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

nit: If the location is specified by the user then need to consider DB as external DB,

sujith71955 · 2018-09-21T09:46:55Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

please pull finally to 1 level up inline with the close brace of try

sujith71955 · 2018-09-21T09:49:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

please pull catch to 1 level up inline with the close brace of try

srowen · 2018-09-21T13:48:49Z

See JIRA, I don't think this should be merged.

sandeep-katta · 2018-09-24T04:35:06Z

Yes I agree 2 database should not point to same path,currently this is the loop hole in spark which is required to fix.If this solution is not okay ,then we can append the dbname.db to the location given by the user
for e.g
create database db1 location /user/hive/warehouse
then the location of the DB should be /user/hive/warehouse/db1.db

sandeep-katta · 2018-09-24T05:08:51Z

See JIRA, I don't think this should be merged.

I have referred Databricks doc https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-database.html and implemented accordingly.Let me know if any suggesstion

srowen · 2018-09-24T13:56:09Z

We should look at Spark documentation, and Hive, if any, to figure out what the right behavior is here. Spark generally follows Hive. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable I think this is, further, conflating what LOCATION and EXTERNAL does. I agree that external DB files shouldn't be deleted, but not simply those specified by LOCATION. At least that is my understanding.
@yhuai or @cloud-fan or @clockfly might know more.

cloud-fan · 2018-09-24T14:01:14Z

I'm not sure if there is a concept called "external database" in Hive...

srowen · 2018-09-24T14:58:17Z

There is ... see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ManagedandExternalTables

I think Spark conflates the two. It's rare (?) but possible to specify a custom location of a managed table, but, typically occurs for EXTERNAL tables. So maybe this is OK.

  private def createTable(tableIdent: TableIdentifier): Unit = {
    val storage = DataSource.buildStorageFormatFromOptions(extraOptions.toMap)
    val tableType = if (storage.locationUri.isDefined) {
      CatalogTableType.EXTERNAL
    } else {
      CatalogTableType.MANAGED
    }

And in SqlParser:

    // If location is defined, we'll assume this is an external table.
    // Otherwise, we may accidentally delete existing data.
    val tableType = if (external || location.isDefined) {
      CatalogTableType.EXTERNAL
    } else {
      CatalogTableType.MANAGED
    }

So if LOCATION implies EXTERNAL in Spark, then I get this. EXTERNAL tables shouldn't be deleted.

I agree that the Hive impl doesn't seem to take this into account, on the code paths that call dropDatabase. CC @andrewor14 in case he is available to comment on the original implementaiton. WDYT @cloud-fan

cloud-fan · 2018-09-25T02:24:08Z

yea, in Spark we conflate the two and treat a table as external if location is specified.

However, Hive doesn't have external database, see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/Alter/UseDatabase

I don't want to introduce unnecessary behavior difference from hive, and I feel it's not very useful to have external database. Although your table files can be in an existing folder, but LIST TABLES will not work.

srowen · 2018-09-25T03:30:18Z

That link says Hive does support EXTERNAL. What am I missing? Well, in any event we aren't contemplating a behavior change here.

If you delete a table with LOCATION specified, what should happen? Hive would delete it I guess... unless it's EXTERNAL. Looks like Spark would delete it even when it's in an 'external' location.

If these are conflated I guess we weigh the surprise in not deleting 'local' locations for data when a table is dropped, vs surprise in deleting 'external' locations for data.

cloud-fan · 2018-09-25T04:03:11Z

That link says Hive does support EXTERNAL. What am I missing?

Hive supports EXTERNAL only for tables, not databases.
The CREATE TABLE syntax:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ...

The CREATE DATABASE syntax:

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name ...

srowen · 2018-09-25T04:40:09Z

Owp, I've been misreading that several times. Right. Well by analogy, if a database has a non default LOCATION then so do it's tables, and they are treated like EXTERNAL tables. Dropping the DB means dropping the tables, and dropping those tables doesn't delete data. So should the same happen for DBs? Seems sensible, because the DB directory might not even be empty.

Still I feel like I'm missing something if it only comes up in the case that two DBs have the same location, which is going to cause a bunch of other problems.

But is it the right change simply because it's consistent with how dropping tables works?

cloud-fan · 2018-09-25T04:51:29Z

This is a behavior change and makes us different from Hive. However I can't find a strong reason to do it. It's like importing a database, but we can't automatically create table entries in the metastore when creating a database with an existing location.

To me a more reasonable behavior is, fail earlier when creating a database with an existing and non-empty location.

sandeep-katta · 2018-09-25T05:31:46Z

seems @cloud-fan comments are valid as it will not result in any behavior change, I will update the PR accordingly WDYT @srowen

sandeep-katta · 2018-09-26T05:39:15Z

cc @cloud-fan @srowen I have updated the code,Please review

cloud-fan · 2018-09-26T06:16:08Z

ok to test

SparkQA · 2018-09-26T07:05:01Z

Test build #96601 has finished for PR 22466 at commit 597f8e6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-26T08:20:31Z

retest this please

SparkQA · 2018-09-26T12:30:08Z

Test build #96619 has finished for PR 22466 at commit 597f8e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-26T13:09:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

not sure if we should do it here or in the external catalog, cc @gatorsmile

The reason I did here is, external catalog always have path (default or user specified),so we end up in checking the path always.This may be the costly operation if the file system is S3

Just want to confirm it what is the behavior of Hive (3.x) when we try to create a database in a non-empty file path?

Please try the following cases:

the directory does not exist and the parent directory does not exist too.

the directory exists but empty

the directory exists and non-empty

the directory is not accessible

I think we should only block the last two cases. I also prefer to doing it in the external catalog. In the future, if we support catalog federation, different external catalog can define their own behavior.

Listing files in S3 is expensive, but create a new database is not frequent. I think our users can accept this cost.

Just want to confirm it what is the behavior of Hive (3.x) when we try to create a database in a non-empty file path?

User can create the Database on the non-empty file path,results are below

host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin # ./hadoop fs -ls /user/hive/
Found 2 items
drwxr-xr-x - root supergroup 0 2018-09-27 18:05 /user/hive/dropTest
drwxr-xr-x - root supergroup 0 2018-09-27 17:19 /user/hive/warehouse

hive> create database testjira location '/user/hive/';
OK
Time taken: 0.187 seconds
hive> desc database testjira;
OK
testjira hdfs://localhost:9000/user/hive root USER
Time taken: 0.045 seconds, Fetched: 1 row(s)
hive> drop database testjira;
OK
Time taken: 0.194 seconds
hive>

host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin # ./hadoop fs -ls /user
host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin #

Please try the following cases:

the directory does not exist and the parent directory does not exist too.
spark-sql> create database noparent location '/user/dropTest/noparent/dblocation';
Time taken: 2.246 seconds

the directory exists but empty
spark-sql> create database emptydir location '/user/dropTest/emptydir';
Time taken: 0.262 seconds

the directory exists and non-empty
spark-sql> create database dirExistsAndNotempty location '/user/dropTest/';
Error in query: Cannot create Database to the location which is not empty.;

the directory is not accessible
spark-sql> create database userNoPermission location '/user/hive/dropTest';
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=sandeep, access=WRITE, inode="/user/hive":hive:hive:drwxr-x---
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:261)

gatorsmile · 2018-11-23T18:50:34Z

Regarding the potentially high cost of file listing, CREATE DATABASE is not a frequent operation. The cost is high only if the target directory is non-empty with many many files. We are blocking users from creating such a database. Thus, the cost is not a big deal I think.

We need to list this behavior change in the SQL migration guide.

sandeep-katta · 2019-03-07T06:29:23Z

@gatorsmile I have updated the SQL migration guide

SparkQA · 2019-03-07T08:05:01Z

Test build #103124 has finished for PR 22466 at commit 6532466.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-07T08:05:01Z

Test build #103122 has finished for PR 22466 at commit 6f4c9ce.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-07T08:12:27Z

retest this please

SparkQA · 2019-03-07T11:58:43Z

Test build #103130 has finished for PR 22466 at commit 6532466.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sandeep-katta · 2019-03-09T02:19:14Z

retest this please

HyukjinKwon · 2019-03-09T03:15:30Z

retest this please

SparkQA · 2019-03-09T07:47:30Z

Test build #103252 has finished for PR 22466 at commit 6532466.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sandeep-katta · 2019-03-09T12:23:13Z

cc @srowen @cloud-fan @gatorsmile

sandeep-katta · 2019-08-12T11:30:47Z

ping @srowen @gatorsmile @ueshin @cloud-fan . Please check now this can be merged to master branch for spark-3.0 release . I am happy to do any more code/documentation changes if required.

srowen · 2019-08-12T13:30:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

This says 'path not empty' but that suggests 'path is not the empty string'. This means 'the path exists already' right?

yes , it means path already exists, I will update the message accordingly

If I change the message it will lead to confusion.
for e.g.
User can be able to create database to the location as long as the path does not exists or it should be empty.

If I say path already exits it will confuse, so I wanted to highlight the path is not empty

Are you trying to describe a case where the path is a directory that exists, but contains files? or a path that does not exist at all? either way I think that type of description is better. An empty path implies something different from empty directory.

yes I wanted to describe a case where the path is a directory that exists but contains files. if the above description is correct then I will update it

Modification content:Handled Review Comments

Modification content:Updated the sql-migration-guide to notify the user about behaviour change

SparkQA · 2019-08-12T17:22:27Z

Test build #108984 has finished for PR 22466 at commit 0a52bdc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…rnalCatalogSuite ## What changes were proposed in this pull request? drop the table after the test `query builtin functions don't call the external catalog` executed This is required for [SPARK-25464](#22466) ## How was this patch tested? existing UT Closes #25427 from sandeep-katta/cleanuptable. Authored-by: sandeep katta <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

AmplabJenkins · 2019-09-16T18:19:08Z

Can one of the admins verify this patch?

github-actions · 2020-01-06T00:07:31Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

sandeep-katta force-pushed the dropDatabse branch from c2eec35 to e327105 Compare September 19, 2018 12:24

sandeep-katta changed the title ~~[SPARK-25464][SQL]When database is dropped all the data related to it is deleted~~ [SPARK-25464][SQL]On dropping the Database it will drop the contents of other database if it is created using location URI(path of the DB2) Sep 21, 2018

sujith71955 reviewed Sep 21, 2018

View reviewed changes

sandeep-katta force-pushed the dropDatabse branch from e327105 to eec8b39 Compare September 25, 2018 05:05

sandeep-katta force-pushed the dropDatabse branch 2 times, most recently from 74629ad to 597f8e6 Compare September 26, 2018 05:06

cloud-fan reviewed Sep 26, 2018

View reviewed changes

sandeep-katta force-pushed the dropDatabse branch from 570c5b7 to 6f4c9ce Compare March 7, 2019 05:51

sandeep-katta mentioned this pull request Mar 8, 2019

[SPARK-27101][PYTHON] Drop the created database after the test in test_session #24021

Closed

dongjoon-hyun added the SQL label Jun 14, 2019

srowen reviewed Aug 12, 2019

View reviewed changes

sandeep-katta added 3 commits August 12, 2019 20:29

RootCause:When database is dropped all the data related to it is deleted

49e272d

Modification content:Handled Review Comments

RootCause:When database is dropped all the data related to it is deleted

b2ec189

Modification content:Updated the sql-migration-guide to notify the user about behaviour change

review comments fixed

0a52bdc

sandeep-katta force-pushed the dropDatabse branch from 6532466 to 0a52bdc Compare August 12, 2019 15:41

sandeep-katta mentioned this pull request Aug 13, 2019

[SPARK-28705][SQL][TEST] Drop tables after being used in AnalysisExternalCatalogSuite #25427

Closed

github-actions bot added the Stale label Jan 6, 2020

srowen closed this Jan 6, 2020

[SPARK-25464][SQL] Create Database to the location,only if it is empty or does not exists. #22466

[SPARK-25464][SQL] Create Database to the location,only if it is empty or does not exists. #22466

Uh oh!

Conversation

sandeep-katta commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sujith71955 commented Sep 21, 2018

Uh oh!

sujith71955 commented Sep 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 21, 2018

Uh oh!

sandeep-katta commented Sep 24, 2018

Uh oh!

sandeep-katta commented Sep 24, 2018

Uh oh!

srowen commented Sep 24, 2018

Uh oh!

cloud-fan commented Sep 24, 2018

Uh oh!

srowen commented Sep 24, 2018

Uh oh!

cloud-fan commented Sep 25, 2018

Uh oh!

srowen commented Sep 25, 2018

Uh oh!

cloud-fan commented Sep 25, 2018

Uh oh!

srowen commented Sep 25, 2018

Uh oh!

cloud-fan commented Sep 25, 2018

Uh oh!

sandeep-katta commented Sep 25, 2018

Uh oh!

sandeep-katta commented Sep 26, 2018

Uh oh!

cloud-fan commented Sep 26, 2018

Uh oh!

SparkQA commented Sep 26, 2018

Uh oh!

cloud-fan commented Sep 26, 2018

Uh oh!

SparkQA commented Sep 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandeep-katta Sep 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandeep-katta Sep 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Nov 23, 2018

Uh oh!

sandeep-katta commented Mar 7, 2019

Uh oh!

SparkQA commented Mar 7, 2019

Uh oh!

SparkQA commented Mar 7, 2019

Uh oh!

dilipbiswal commented Mar 7, 2019

sandeep-katta commented Sep 19, 2018 •

edited

Loading

sandeep-katta Sep 27, 2018 •

edited

Loading

sandeep-katta Sep 27, 2018 •

edited

Loading