Skip to content

Conversation

@xwu0226
Copy link
Contributor

@xwu0226 xwu0226 commented Jan 17, 2017

What changes were proposed in this pull request?

SupportALTER TABLE ADD COLUMNS (...)syntax for Hive serde and some datasource tables.
In this PR, we consider a few aspects:

  1. View is not supported for ALTER ADD COLUMNS

  2. Since tables created in SparkSQL with Hive DDL syntax will populate table properties with schema information, we need make sure the consistency of the schema before and after ALTER operation in order for future use.

  3. For embedded-schema type of format, such as parquet, we need to make sure that the predicate on the newly-added columns can be evaluated properly, or pushed down properly. In case of the data file does not have the columns for the newly-added columns, such predicates should return as if the column values are NULLs.

  4. For datasource table, this feature does not support the following:
    4.1 TEXT format, since there is only one default column value is inferred for text format data.
    4.2 ORC format, since SparkSQL native ORC reader does not support the difference between user-specified-schema and inferred schema from ORC files.
    4.3 Third party datasource types that implements RelationProvider, including the built-in JDBC format, since different implementations by the vendors may have different ways to dealing with schema.
    4.4 Other datasource types, such as parquet, json, csv, hive are supported.

  5. Column names being added can not be duplicate of any existing data column or partition column names. Case sensitivity is taken into consideration according to the sql configuration.

  6. This feature also supports In-Memory catalog, while Hive support is turned off.

How was this patch tested?

Add new test cases

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71535 has finished for PR 16626 at commit 96fb677.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Jan 20, 2017

@gatorsmile @cloud-fan Thanks for reviewing!

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71728 has finished for PR 16626 at commit ffb8b55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the Hive syntax. At least, we can support the column comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better. Please copy this to SparkSqlParser.scala

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71749 has finished for PR 16626 at commit 73b0243.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason why data source tables are not supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that there are different ways to create a datasource table, such as df.write.saveAsTable, or create with "create table " DDL with/without schema. Plus JDBC datasource table maybe not be supported.. I just want to spend more time on trying different scenarios to see if there is any hole before claiming supporting it. I will submit another PR once I am sure it is handled correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, their code paths for managing hive serde tables and data source tables have been combined. Thus, it can be easily handled together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not supporting partitioned tables, right?

Copy link
Contributor Author

@xwu0226 xwu0226 Jan 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support partitioned tables. The test cases added include this case.
However, we don't support ALTER ADD COLUMNS to a particular partition, as what Hive can do today. EX: ALTER TABLE T1 PARTITION(c3=1) ADD COLUMNS .... . This is another potential feature to add if we maintain schema for a partition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO? I think the newer Parquet can handle this issue. Once we upgrade Parquet version, we don't need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. we can. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already upgrade Parquet, so we don't need this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the following to check cache status? I think uncacheTable is no-op if the table is not cached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AlterTableRenameCommand has similar way to do the uncaching. I thought there might be a reason it exists there. So I did the same. But looking at the code, it seems you are right. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current way is right. The implementation should not rely on the internal behavior of another function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this branch is for datasource table, but looks like you don't support datasource table yet in this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the valuable name needs to change since now the hive table and datasource table both populate the table properties with the schema. Both cases will go through this path. I temporarily block the datasource table ALTER ADD columns because I am not confident yet if I have holes. But according to @gatorsmile , it may be safe to support datasource table too. So I am actually adding more test cases to confirm. I may remove the condition in this PR.

Copy link
Member

@viirya viirya Jan 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns are not optional for this case.

@xwu0226 xwu0226 changed the title [SPARK-19261][SQL] Alter add columns for Hive tables [SPARK-19261][SQL] Alter add columns for Hive serde and some datasource tables Feb 3, 2017
@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72320 has finished for PR 16626 at commit 26e0940.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72341 has finished for PR 16626 at commit e2c53a2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2017

Test build #72406 has started for PR 16626 at commit 88c2f48.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 6, 2017

Test build #72447 has finished for PR 16626 at commit 88c2f48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be a private function of AlterTableAddColumnsCommand , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. this is ddl util function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this checking is only used in AlterTableAddColumnsCommand , we do not need to move it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will move to AlterTableAddColumnsCommand class

Copy link
Member

@gatorsmile gatorsmile Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider could be empty if the table is a VIEW. Thus, please do not modify the utility function here. Add a private function in AlterTableAddColumnsCommand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I will find another way. Thanks!

Copy link
Member

@gatorsmile gatorsmile Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call getTempViewOrPermanentTableMetadata instead of getTableMetadata. Then, you do not need the above check for temporary views. In addition, it also covers the cases for global temp views.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Will do. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we store the metadata in the catalog, we unify different representations to orc, right? Can you find any case to break it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will double check with this case.. If orc is the only representation in CatalogTable.provider, I will reduce the logic here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileFormat only covers a few cases. It does not cover the other external data sources. How about using a white list here in this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will use the white list of allowed FileFormat implementations.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72530 has finished for PR 16626 at commit 0b7c0b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72540 has finished for PR 16626 at commit d67042f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xwu0226 xwu0226 force-pushed the alter_add_columns branch from a28fc42 to 1eb7cd3 Compare March 19, 2017 06:17
@SparkQA
Copy link

SparkQA commented Mar 19, 2017

Test build #74810 has started for PR 16626 at commit 1eb7cd3.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Mar 19, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Mar 19, 2017

Test build #74816 has finished for PR 16626 at commit 1eb7cd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

""".stripMargin)
}

// make sure partition columns are at the end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogTable.partitionSchema will throw exception if partition columns are not at the end, so we can just call partitionSchema, no need to do the reordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks! My understanding is that the caller may pass in a new schema that may not follow the order in that partition column is added to the end. So i want to ensure that before passing to exernalCatalog.alterTableSchema.

How about I change the definition of asking caller to ensure the column ordering in the newSchema before calling this function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

}
}

test("alter table add columns to table referenced by a view") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed, the view test already covered the case when the referenced table change its schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will remove this test.

} else {
if (isUsingHiveMetastore) {
// hive catalog will still complains that c1 is duplicate column name because hive
// identifiers are case insensitive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we can fix this, as we store the schema in table properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I just tested the data source table, like create table t1 (c1 int, C1 int) using parquet with spark.sql.caseSensitive = true, spark sql does not complain.. it just bounce back the exception from hive, but logged as WARN message. And the table was created successfully and I am able to insert and select. But if i create a hive serde table with create table t2 (c1 int, C1 int) stored as parquet, hive will complain and fail to create the table. So for the data source case, should we fix anything regarding the WARN message? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, for hive, we can only make it case-preserving, not case-sensitive, I was wrong

}
}

test("ALTER TABLE ADD COLUMNS") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to test all the unsupported data sources? that's a lot, and unnecessary. I think the text format test is enough, let's remove others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. You mean remove the tests from JDBCSuite and TableScanSuite?

assert(comments === "SN,SA,NO_COMMENT")
}

test("ALTER TABLE ADD COLUMNS does not support RelationProvider") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Will do.

}
}

Seq("orc", "ORC", "org.apache.spark.sql.hive.orc",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. will do.

withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive) {
withTable("tab") {
sql("CREATE TABLE tab (c1 int) PARTITIONED BY (c2 int) STORED AS PARQUET")
if (caseSensitive == "false") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!caseSensitive)

@xwu0226
Copy link
Contributor Author

xwu0226 commented Mar 21, 2017

@cloud-fan @gatorsmile Thanks again! I updated the code based @cloud-fan 's review.

// assuming the newSchema has all partition columns at the end as required
externalCatalog.alterTableSchema(db, table, StructType(newSchema))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?


externalCatalog.alterTableSchema(db, table, StructType(reorderedSchema))
// assuming the newSchema has all partition columns at the end as required
externalCatalog.alterTableSchema(db, table, StructType(newSchema))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructType(newSchema) -> newSchema

sessionCatalog.alterTableSchema(
TableIdentifier("t1", Some("default")), oldTab.schema.add("c3", IntegerType))
TableIdentifier("t1", Some("default")),
StructType(oldTab.dataSchema.add("c3", IntegerType) ++ partitionSchema))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionSchema -> oldTab.partitionSchema

withBasicCatalog { sessionCatalog =>
sessionCatalog.createTable(newTable("t1", "default"), ignoreIfExists = false)
val oldTab = sessionCatalog.externalCatalog.getTable("default", "t1")
val partitionSchema = oldTab.partitionSchema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line

@gatorsmile
Copy link
Member

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74929 has finished for PR 16626 at commit 04ce8f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74937 has finished for PR 16626 at commit 7d8437d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Mar 21, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74943 has finished for PR 16626 at commit 7d8437d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74952 has finished for PR 16626 at commit 7d8437d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 4c0ff5f Mar 21, 2017
@gatorsmile
Copy link
Member

Thanks! Merging to master.

jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
…ce tables

Support` ALTER TABLE ADD COLUMNS (...) `syntax for Hive serde and some datasource tables.
In this PR, we consider a few aspects:

1. View is not supported for `ALTER ADD COLUMNS`

2. Since tables created in SparkSQL with Hive DDL syntax will populate table properties with schema information, we need make sure the consistency of the schema before and after ALTER operation in order for future use.

3. For embedded-schema type of format, such as `parquet`, we need to make sure that the predicate on the newly-added columns can be evaluated properly, or pushed down properly. In case of the data file does not have the columns for the newly-added columns, such predicates should return as if the column values are NULLs.

4. For datasource table, this feature does not support the following:
4.1 TEXT format, since there is only one default column `value` is inferred for text format data.
4.2 ORC format, since SparkSQL native ORC reader does not support the difference between user-specified-schema and inferred schema from ORC files.
4.3 Third party datasource types that implements RelationProvider, including the built-in JDBC format, since different implementations by the vendors may have different ways to dealing with schema.
4.4 Other datasource types, such as `parquet`, `json`, `csv`, `hive` are supported.

5. Column names being added can not be duplicate of any existing data column or partition column names. Case sensitivity is taken into consideration according to the sql configuration.

6. This feature also supports In-Memory catalog, while Hive support is turned off.
Add new test cases

Author: Xin Wu <[email protected]>

Closes apache#16626 from xwu0226/alter_add_columns.

(cherry picked from commit 4c0ff5f)
(cherry picked from commit 297cfc9a604f9d098307e8c04e0aaafda87f5eff)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants