-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25464][SQL] Create Database to the location,only if it is empty or does not exists. #22466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c2eec35 to
e327105
Compare
|
@sandeep-katta can you please update the tile, i think as per your description it seems to be the data will be dropped if the location of the newly created database points to some existing database location. |
|
@srowen @HyukjinKwon I think this can be a risk if the location of the newly created database points to an existing one, if user drop the db both the db data will be lost . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: user shall take care of deleting the data from external location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If the location is specified by the user then need to consider DB as external DB,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please pull finally to 1 level up inline with the close brace of try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please pull catch to 1 level up inline with the close brace of try
|
See JIRA, I don't think this should be merged. |
|
Yes I agree 2 database should not point to same path,currently this is the loop hole in spark which is required to fix.If this solution is not okay ,then we can append the dbname.db to the location given by the user |
I have referred Databricks doc https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-database.html and implemented accordingly.Let me know if any suggesstion |
|
We should look at Spark documentation, and Hive, if any, to figure out what the right behavior is here. Spark generally follows Hive. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable I think this is, further, conflating what |
|
I'm not sure if there is a concept called "external database" in Hive... |
|
There is ... see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ManagedandExternalTables I think Spark conflates the two. It's rare (?) but possible to specify a custom location of a managed table, but, typically occurs for And in So if I agree that the Hive impl doesn't seem to take this into account, on the code paths that call |
|
yea, in Spark we conflate the two and treat a table as external if location is specified. However, Hive doesn't have external database, see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/Alter/UseDatabase I don't want to introduce unnecessary behavior difference from hive, and I feel it's not very useful to have external database. Although your table files can be in an existing folder, but LIST TABLES will not work. |
|
That link says Hive does support EXTERNAL. What am I missing? Well, in any event we aren't contemplating a behavior change here. If you delete a table with LOCATION specified, what should happen? Hive would delete it I guess... unless it's EXTERNAL. Looks like Spark would delete it even when it's in an 'external' location. If these are conflated I guess we weigh the surprise in not deleting 'local' locations for data when a table is dropped, vs surprise in deleting 'external' locations for data. |
Hive supports The CREATE DATABASE syntax: |
|
Owp, I've been misreading that several times. Right. Well by analogy, if a database has a non default LOCATION then so do it's tables, and they are treated like EXTERNAL tables. Dropping the DB means dropping the tables, and dropping those tables doesn't delete data. So should the same happen for DBs? Seems sensible, because the DB directory might not even be empty. Still I feel like I'm missing something if it only comes up in the case that two DBs have the same location, which is going to cause a bunch of other problems. But is it the right change simply because it's consistent with how dropping tables works? |
|
This is a behavior change and makes us different from Hive. However I can't find a strong reason to do it. It's like importing a database, but we can't automatically create table entries in the metastore when creating a database with an existing location. To me a more reasonable behavior is, fail earlier when creating a database with an existing and non-empty location. |
e327105 to
eec8b39
Compare
|
seems @cloud-fan comments are valid as it will not result in any behavior change, I will update the PR accordingly WDYT @srowen |
74629ad to
597f8e6
Compare
|
cc @cloud-fan @srowen I have updated the code,Please review |
|
ok to test |
|
Test build #96601 has finished for PR 22466 at commit
|
|
retest this please |
|
Test build #96619 has finished for PR 22466 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we should do it here or in the external catalog, cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did here is, external catalog always have path (default or user specified),so we end up in checking the path always.This may be the costly operation if the file system is S3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm it what is the behavior of Hive (3.x) when we try to create a database in a non-empty file path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try the following cases:
- the directory does not exist and the parent directory does not exist too.
- the directory exists but empty
- the directory exists and non-empty
- the directory is not accessible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only block the last two cases. I also prefer to doing it in the external catalog. In the future, if we support catalog federation, different external catalog can define their own behavior.
Listing files in S3 is expensive, but create a new database is not frequent. I think our users can accept this cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm it what is the behavior of Hive (3.x) when we try to create a database in a non-empty file path?
User can create the Database on the non-empty file path,results are below
host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin # ./hadoop fs -ls /user/hive/
Found 2 items
drwxr-xr-x - root supergroup 0 2018-09-27 18:05 /user/hive/dropTest
drwxr-xr-x - root supergroup 0 2018-09-27 17:19 /user/hive/warehouse
hive> create database testjira location '/user/hive/';
OK
Time taken: 0.187 seconds
hive> desc database testjira;
OK
testjira hdfs://localhost:9000/user/hive root USER
Time taken: 0.045 seconds, Fetched: 1 row(s)
hive> drop database testjira;
OK
Time taken: 0.194 seconds
hive>
host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin # ./hadoop fs -ls /user
host1:/opt/tools/hadoop3.1/hadoop-3.1.1/bin #
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try the following cases:
- the directory does not exist and the parent directory does not exist too.
spark-sql> create database noparent location '/user/dropTest/noparent/dblocation';
Time taken: 2.246 seconds
- the directory exists but empty
spark-sql> create database emptydir location '/user/dropTest/emptydir';
Time taken: 0.262 seconds
- the directory exists and non-empty
spark-sql> create database dirExistsAndNotempty location '/user/dropTest/';
Error in query: Cannot create Database to the location which is not empty.;
- the directory is not accessible
spark-sql> create database userNoPermission location '/user/hive/dropTest';
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=sandeep, access=WRITE, inode="/user/hive":hive:hive:drwxr-x---
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:261)
|
Regarding the potentially high cost of file listing, We need to list this behavior change in the SQL migration guide. |
570c5b7 to
6f4c9ce
Compare
|
@gatorsmile I have updated the SQL migration guide |
|
Test build #103124 has finished for PR 22466 at commit
|
|
Test build #103122 has finished for PR 22466 at commit
|
|
retest this please |
|
Test build #103130 has finished for PR 22466 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #103252 has finished for PR 22466 at commit
|
|
ping @srowen @gatorsmile @ueshin @cloud-fan . Please check now this can be merged to master branch for spark-3.0 release . I am happy to do any more code/documentation changes if required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says 'path not empty' but that suggests 'path is not the empty string'. This means 'the path exists already' right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , it means path already exists, I will update the message accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change the message it will lead to confusion.
for e.g.
User can be able to create database to the location as long as the path does not exists or it should be empty.
If I say path already exits it will confuse, so I wanted to highlight the path is not empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to describe a case where the path is a directory that exists, but contains files? or a path that does not exist at all? either way I think that type of description is better. An empty path implies something different from empty directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I wanted to describe a case where the path is a directory that exists but contains files. if the above description is correct then I will update it
Modification content:Handled Review Comments
Modification content:Updated the sql-migration-guide to notify the user about behaviour change
6532466 to
0a52bdc
Compare
|
Test build #108984 has finished for PR 22466 at commit
|
…rnalCatalogSuite ## What changes were proposed in this pull request? drop the table after the test `query builtin functions don't call the external catalog` executed This is required for [SPARK-25464](#22466) ## How was this patch tested? existing UT Closes #25427 from sandeep-katta/cleanuptable. Authored-by: sandeep katta <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
Modification content:If the database is created with location URI then should be empty/should not exists
What changes were proposed in this pull request?
If the user specifies the location while creating the Database then location should be empty or should not exists
for e.g create database db1 location '/user/hive/warehouse';
this command will be passed only if 'user/hive/warehouse' is empty or should not exists.
How was this patch tested?
Added testcases and also manually verified in the cluster