-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-19261][SQL] Alter add columns for Hive serde and some datasource tables #16626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #71535 has finished for PR 16626 at commit
|
|
@gatorsmile @cloud-fan Thanks for reviewing! |
|
Test build #71728 has finished for PR 16626 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the Hive syntax. At least, we can support the column comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better. Please copy this to SparkSqlParser.scala
|
Test build #71749 has finished for PR 16626 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason why data source tables are not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that there are different ways to create a datasource table, such as df.write.saveAsTable, or create with "create table " DDL with/without schema. Plus JDBC datasource table maybe not be supported.. I just want to spend more time on trying different scenarios to see if there is any hole before claiming supporting it. I will submit another PR once I am sure it is handled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, their code paths for managing hive serde tables and data source tables have been combined. Thus, it can be easily handled together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not supporting partitioned tables, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support partitioned tables. The test cases added include this case.
However, we don't support ALTER ADD COLUMNS to a particular partition, as what Hive can do today. EX: ALTER TABLE T1 PARTITION(c3=1) ADD COLUMNS .... . This is another potential feature to add if we maintain schema for a partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TODO? I think the newer Parquet can handle this issue. Once we upgrade Parquet version, we don't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. we can. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already upgrade Parquet, so we don't need this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the following to check cache status? I think uncacheTable is no-op if the table is not cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AlterTableRenameCommand has similar way to do the uncaching. I thought there might be a reason it exists there. So I did the same. But looking at the code, it seems you are right. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current way is right. The implementation should not rely on the internal behavior of another function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think this branch is for datasource table, but looks like you don't support datasource table yet in this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the valuable name needs to change since now the hive table and datasource table both populate the table properties with the schema. Both cases will go through this path. I temporarily block the datasource table ALTER ADD columns because I am not confident yet if I have holes. But according to @gatorsmile , it may be safe to support datasource table too. So I am actually adding more test cases to confirm. I may remove the condition in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns are not optional for this case.
73b0243 to
26e0940
Compare
|
Test build #72320 has finished for PR 16626 at commit
|
26e0940 to
e2c53a2
Compare
|
Test build #72341 has finished for PR 16626 at commit
|
e2c53a2 to
88c2f48
Compare
|
Test build #72406 has started for PR 16626 at commit |
|
retest this please |
|
Test build #72447 has finished for PR 16626 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should be a private function of AlterTableAddColumnsCommand , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. this is ddl util function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this checking is only used in AlterTableAddColumnsCommand , we do not need to move it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will move to AlterTableAddColumnsCommand class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provider could be empty if the table is a VIEW. Thus, please do not modify the utility function here. Add a private function in AlterTableAddColumnsCommand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will find another way. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call getTempViewOrPermanentTableMetadata instead of getTableMetadata. Then, you do not need the above check for temporary views. In addition, it also covers the cases for global temp views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Will do. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we store the metadata in the catalog, we unify different representations to orc, right? Can you find any case to break it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will double check with this case.. If orc is the only representation in CatalogTable.provider, I will reduce the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileFormat only covers a few cases. It does not cover the other external data sources. How about using a white list here in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will use the white list of allowed FileFormat implementations.
88c2f48 to
0b7c0b1
Compare
|
Test build #72530 has finished for PR 16626 at commit
|
|
Test build #72540 has finished for PR 16626 at commit
|
d67042f to
193c0c3
Compare
a28fc42 to
1eb7cd3
Compare
|
Test build #74810 has started for PR 16626 at commit |
|
retest this please |
|
Test build #74816 has finished for PR 16626 at commit
|
| """.stripMargin) | ||
| } | ||
|
|
||
| // make sure partition columns are at the end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CatalogTable.partitionSchema will throw exception if partition columns are not at the end, so we can just call partitionSchema, no need to do the reordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thanks! My understanding is that the caller may pass in a new schema that may not follow the order in that partition column is added to the end. So i want to ensure that before passing to exernalCatalog.alterTableSchema.
How about I change the definition of asking caller to ensure the column ordering in the newSchema before calling this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
| } | ||
| } | ||
|
|
||
| test("alter table add columns to table referenced by a view") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed, the view test already covered the case when the referenced table change its schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I will remove this test.
| } else { | ||
| if (isUsingHiveMetastore) { | ||
| // hive catalog will still complains that c1 is duplicate column name because hive | ||
| // identifiers are case insensitive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we can fix this, as we store the schema in table properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I just tested the data source table, like create table t1 (c1 int, C1 int) using parquet with spark.sql.caseSensitive = true, spark sql does not complain.. it just bounce back the exception from hive, but logged as WARN message. And the table was created successfully and I am able to insert and select. But if i create a hive serde table with create table t2 (c1 int, C1 int) stored as parquet, hive will complain and fail to create the table. So for the data source case, should we fix anything regarding the WARN message? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, for hive, we can only make it case-preserving, not case-sensitive, I was wrong
| } | ||
| } | ||
|
|
||
| test("ALTER TABLE ADD COLUMNS") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we going to test all the unsupported data sources? that's a lot, and unnecessary. I think the text format test is enough, let's remove others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. You mean remove the tests from JDBCSuite and TableScanSuite?
| assert(comments === "SN,SA,NO_COMMENT") | ||
| } | ||
|
|
||
| test("ALTER TABLE ADD COLUMNS does not support RelationProvider") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Will do.
| } | ||
| } | ||
|
|
||
| Seq("orc", "ORC", "org.apache.spark.sql.hive.orc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. will do.
| withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive) { | ||
| withTable("tab") { | ||
| sql("CREATE TABLE tab (c1 int) PARTITIONED BY (c2 int) STORED AS PARQUET") | ||
| if (caseSensitive == "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!caseSensitive)
|
@cloud-fan @gatorsmile Thanks again! I updated the code based @cloud-fan 's review. |
| // assuming the newSchema has all partition columns at the end as required | ||
| externalCatalog.alterTableSchema(db, table, StructType(newSchema)) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
|
|
||
| externalCatalog.alterTableSchema(db, table, StructType(reorderedSchema)) | ||
| // assuming the newSchema has all partition columns at the end as required | ||
| externalCatalog.alterTableSchema(db, table, StructType(newSchema)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StructType(newSchema) -> newSchema
| sessionCatalog.alterTableSchema( | ||
| TableIdentifier("t1", Some("default")), oldTab.schema.add("c3", IntegerType)) | ||
| TableIdentifier("t1", Some("default")), | ||
| StructType(oldTab.dataSchema.add("c3", IntegerType) ++ partitionSchema)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitionSchema -> oldTab.partitionSchema
| withBasicCatalog { sessionCatalog => | ||
| sessionCatalog.createTable(newTable("t1", "default"), ignoreIfExists = false) | ||
| val oldTab = sessionCatalog.externalCatalog.getTable("default", "t1") | ||
| val partitionSchema = oldTab.partitionSchema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line
|
LGTM pending Jenkins. |
|
Test build #74929 has finished for PR 16626 at commit
|
|
Test build #74937 has finished for PR 16626 at commit
|
|
retest this please |
|
Test build #74943 has finished for PR 16626 at commit
|
|
retest this please |
|
Test build #74952 has finished for PR 16626 at commit
|
|
Thanks! Merging to master. |
…ce tables Support` ALTER TABLE ADD COLUMNS (...) `syntax for Hive serde and some datasource tables. In this PR, we consider a few aspects: 1. View is not supported for `ALTER ADD COLUMNS` 2. Since tables created in SparkSQL with Hive DDL syntax will populate table properties with schema information, we need make sure the consistency of the schema before and after ALTER operation in order for future use. 3. For embedded-schema type of format, such as `parquet`, we need to make sure that the predicate on the newly-added columns can be evaluated properly, or pushed down properly. In case of the data file does not have the columns for the newly-added columns, such predicates should return as if the column values are NULLs. 4. For datasource table, this feature does not support the following: 4.1 TEXT format, since there is only one default column `value` is inferred for text format data. 4.2 ORC format, since SparkSQL native ORC reader does not support the difference between user-specified-schema and inferred schema from ORC files. 4.3 Third party datasource types that implements RelationProvider, including the built-in JDBC format, since different implementations by the vendors may have different ways to dealing with schema. 4.4 Other datasource types, such as `parquet`, `json`, `csv`, `hive` are supported. 5. Column names being added can not be duplicate of any existing data column or partition column names. Case sensitivity is taken into consideration according to the sql configuration. 6. This feature also supports In-Memory catalog, while Hive support is turned off. Add new test cases Author: Xin Wu <[email protected]> Closes apache#16626 from xwu0226/alter_add_columns. (cherry picked from commit 4c0ff5f) (cherry picked from commit 297cfc9a604f9d098307e8c04e0aaafda87f5eff)
What changes were proposed in this pull request?
Support
ALTER TABLE ADD COLUMNS (...)syntax for Hive serde and some datasource tables.In this PR, we consider a few aspects:
View is not supported for
ALTER ADD COLUMNSSince tables created in SparkSQL with Hive DDL syntax will populate table properties with schema information, we need make sure the consistency of the schema before and after ALTER operation in order for future use.
For embedded-schema type of format, such as
parquet, we need to make sure that the predicate on the newly-added columns can be evaluated properly, or pushed down properly. In case of the data file does not have the columns for the newly-added columns, such predicates should return as if the column values are NULLs.For datasource table, this feature does not support the following:
4.1 TEXT format, since there is only one default column
valueis inferred for text format data.4.2 ORC format, since SparkSQL native ORC reader does not support the difference between user-specified-schema and inferred schema from ORC files.
4.3 Third party datasource types that implements RelationProvider, including the built-in JDBC format, since different implementations by the vendors may have different ways to dealing with schema.
4.4 Other datasource types, such as
parquet,json,csv,hiveare supported.Column names being added can not be duplicate of any existing data column or partition column names. Case sensitivity is taken into consideration according to the sql configuration.
This feature also supports In-Memory catalog, while Hive support is turned off.
How was this patch tested?
Add new test cases