[SPARK-19261][SQL] Alter add columns for Hive serde and some datasource tables #16626

xwu0226 · 2017-01-17T22:46:30Z

What changes were proposed in this pull request?

SupportALTER TABLE ADD COLUMNS (...)syntax for Hive serde and some datasource tables.
In this PR, we consider a few aspects:

View is not supported for ALTER ADD COLUMNS
Since tables created in SparkSQL with Hive DDL syntax will populate table properties with schema information, we need make sure the consistency of the schema before and after ALTER operation in order for future use.
For embedded-schema type of format, such as parquet, we need to make sure that the predicate on the newly-added columns can be evaluated properly, or pushed down properly. In case of the data file does not have the columns for the newly-added columns, such predicates should return as if the column values are NULLs.
For datasource table, this feature does not support the following:
4.1 TEXT format, since there is only one default column value is inferred for text format data.
4.2 ORC format, since SparkSQL native ORC reader does not support the difference between user-specified-schema and inferred schema from ORC files.
4.3 Third party datasource types that implements RelationProvider, including the built-in JDBC format, since different implementations by the vendors may have different ways to dealing with schema.
4.4 Other datasource types, such as parquet, json, csv, hive are supported.
Column names being added can not be duplicate of any existing data column or partition column names. Case sensitivity is taken into consideration according to the sql configuration.
This feature also supports In-Memory catalog, while Hive support is turned off.

How was this patch tested?

Add new test cases

SparkQA · 2017-01-18T01:14:17Z

Test build #71535 has finished for PR 16626 at commit 96fb677.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xwu0226 · 2017-01-20T16:39:58Z

@gatorsmile @cloud-fan Thanks for reviewing!

SparkQA · 2017-01-20T19:26:16Z

Test build #71728 has finished for PR 16626 at commit ffb8b55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-20T23:14:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

Please check the Hive syntax. At least, we can support the column comment.

gatorsmile · 2017-01-20T23:16:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

This is better. Please copy this to SparkSqlParser.scala

SparkQA · 2017-01-21T03:34:37Z

Test build #71749 has finished for PR 16626 at commit 73b0243.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-21T03:38:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

What is the reason why data source tables are not supported?

I am thinking that there are different ways to create a datasource table, such as df.write.saveAsTable, or create with "create table " DDL with/without schema. Plus JDBC datasource table maybe not be supported.. I just want to spend more time on trying different scenarios to see if there is any hole before claiming supporting it. I will submit another PR once I am sure it is handled correctly.

Currently, their code paths for managing hive serde tables and data source tables have been combined. Thus, it can be easily handled together.

gatorsmile · 2017-01-21T03:39:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

We are not supporting partitioned tables, right?

We support partitioned tables. The test cases added include this case.
However, we don't support ALTER ADD COLUMNS to a particular partition, as what Hive can do today. EX: ALTER TABLE T1 PARTITION(c3=1) ADD COLUMNS .... . This is another potential feature to add if we maintain schema for a partition.

viirya · 2017-01-21T07:05:24Z

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

Can we add a TODO? I think the newer Parquet can handle this issue. Once we upgrade Parquet version, we don't need this.

Yes. we can. Thanks!

We already upgrade Parquet, so we don't need this now.

viirya · 2017-01-21T09:46:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

Do we need the following to check cache status? I think uncacheTable is no-op if the table is not cached.

AlterTableRenameCommand has similar way to do the uncaching. I thought there might be a reason it exists there. So I did the same. But looking at the code, it seems you are right. Thanks!

The current way is right. The implementation should not rely on the internal behavior of another function.

viirya · 2017-01-21T10:03:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Hmm, I think this branch is for datasource table, but looks like you don't support datasource table yet in this change?

I think the valuable name needs to change since now the hive table and datasource table both populate the table properties with the schema. Both cases will go through this path. I temporarily block the datasource table ALTER ADD columns because I am not confident yet if I have holes. But according to @gatorsmile , it may be safe to support datasource table too. So I am actually adding more test cases to confirm. I may remove the condition in this PR.

viirya · 2017-01-22T04:10:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

columns are not optional for this case.

SparkQA · 2017-02-03T22:56:05Z

Test build #72320 has finished for PR 16626 at commit 26e0940.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T03:58:32Z

Test build #72341 has finished for PR 16626 at commit e2c53a2.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-05T07:53:32Z

Test build #72406 has started for PR 16626 at commit 88c2f48.

gatorsmile · 2017-02-06T13:31:36Z

retest this please

SparkQA · 2017-02-06T16:04:15Z

Test build #72447 has finished for PR 16626 at commit 88c2f48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-06T19:11:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

This function should be a private function of AlterTableAddColumnsCommand , right?

oh. this is ddl util function.

Since this checking is only used in AlterTableAddColumnsCommand , we do not need to move it here

Ok. I will move to AlterTableAddColumnsCommand class

gatorsmile · 2017-02-06T20:36:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

The provider could be empty if the table is a VIEW. Thus, please do not modify the utility function here. Add a private function in AlterTableAddColumnsCommand

I see. I will find another way. Thanks!

gatorsmile · 2017-02-06T20:43:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

Call getTempViewOrPermanentTableMetadata instead of getTableMetadata. Then, you do not need the above check for temporary views. In addition, it also covers the cases for global temp views.

I see. Will do. Thanks!

gatorsmile · 2017-02-06T20:47:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

When we store the metadata in the catalog, we unify different representations to orc, right? Can you find any case to break it?

I will double check with this case.. If orc is the only representation in CatalogTable.provider, I will reduce the logic here.

gatorsmile · 2017-02-06T20:48:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

FileFormat only covers a few cases. It does not cover the other external data sources. How about using a white list here in this function?

OK. I will use the white list of allowed FileFormat implementations.

SparkQA · 2017-02-07T21:26:43Z

Test build #72530 has finished for PR 16626 at commit 0b7c0b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-08T00:28:52Z

Test build #72540 has finished for PR 16626 at commit d67042f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-19T06:17:30Z

Test build #74810 has started for PR 16626 at commit 1eb7cd3.

xwu0226 · 2017-03-19T08:45:00Z

retest this please

SparkQA · 2017-03-19T11:09:54Z

Test build #74816 has finished for PR 16626 at commit 1eb7cd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-21T00:21:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+         """.stripMargin)
+    }
+
+    // make sure partition columns are at the end


CatalogTable.partitionSchema will throw exception if partition columns are not at the end, so we can just call partitionSchema, no need to do the reordering.

@cloud-fan Thanks! My understanding is that the caller may pass in a new schema that may not follow the order in that partition column is added to the end. So i want to ensure that before passing to exernalCatalog.alterTableSchema.

How about I change the definition of asking caller to ensure the column ordering in the newSchema before calling this function?

cloud-fan · 2017-03-21T00:27:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+    }
+  }
+
+  test("alter table add columns to table referenced by a view") {


this is not needed, the view test already covered the case when the referenced table change its schema

ok. I will remove this test.

cloud-fan · 2017-03-21T00:29:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+          } else {
+            if (isUsingHiveMetastore) {
+              // hive catalog will still complains that c1 is duplicate column name because hive
+              // identifiers are case insensitive.


actually we can fix this, as we store the schema in table properties.

@cloud-fan I just tested the data source table, like create table t1 (c1 int, C1 int) using parquet with spark.sql.caseSensitive = true, spark sql does not complain.. it just bounce back the exception from hive, but logged as WARN message. And the table was created successfully and I am able to insert and select. But if i create a hive serde table with create table t2 (c1 int, C1 int) stored as parquet, hive will complain and fail to create the table. So for the data source case, should we fix anything regarding the WARN message? Thanks!

ah right, for hive, we can only make it case-preserving, not case-sensitive, I was wrong

cloud-fan · 2017-03-21T00:32:29Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

    }
  }
+
+  test("ALTER TABLE ADD COLUMNS") {


are we going to test all the unsupported data sources? that's a lot, and unnecessary. I think the text format test is enough, let's remove others.

oh. You mean remove the tests from JDBCSuite and TableScanSuite?

cloud-fan · 2017-03-21T00:32:55Z

sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala

    assert(comments === "SN,SA,NO_COMMENT")
  }
+
+  test("ALTER TABLE ADD COLUMNS does not support RelationProvider") {


let's remove it

ok. Will do.

cloud-fan · 2017-03-21T00:33:29Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+    }
+  }
+
+  Seq("orc", "ORC", "org.apache.spark.sql.hive.orc",


let's remove it

OK. will do.

cloud-fan · 2017-03-21T00:34:15Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive) {
+        withTable("tab") {
+          sql("CREATE TABLE tab (c1 int) PARTITIONED BY (c2 int) STORED AS PARQUET")
+          if (caseSensitive == "false") {


if (!caseSensitive)

xwu0226 · 2017-03-21T02:50:01Z

@cloud-fan @gatorsmile Thanks again! I updated the code based @cloud-fan 's review.

gatorsmile · 2017-03-21T02:51:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+    // assuming the newSchema has all partition columns at the end as required
+    externalCatalog.alterTableSchema(db, table, StructType(newSchema))
+  }
+


gatorsmile · 2017-03-21T02:53:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

-
-    externalCatalog.alterTableSchema(db, table, StructType(reorderedSchema))
+    // assuming the newSchema has all partition columns at the end as required
+    externalCatalog.alterTableSchema(db, table, StructType(newSchema))


StructType(newSchema) -> newSchema

gatorsmile · 2017-03-21T02:55:23Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

      sessionCatalog.alterTableSchema(
-        TableIdentifier("t1", Some("default")), oldTab.schema.add("c3", IntegerType))
+        TableIdentifier("t1", Some("default")),
+        StructType(oldTab.dataSchema.add("c3", IntegerType) ++ partitionSchema))


partitionSchema -> oldTab.partitionSchema

gatorsmile · 2017-03-21T02:55:28Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

    withBasicCatalog { sessionCatalog =>
      sessionCatalog.createTable(newTable("t1", "default"), ignoreIfExists = false)
      val oldTab = sessionCatalog.externalCatalog.getTable("default", "t1")
+      val partitionSchema = oldTab.partitionSchema


Remove this line

gatorsmile · 2017-03-21T03:55:34Z

LGTM pending Jenkins.

SparkQA · 2017-03-21T05:10:50Z

Test build #74929 has finished for PR 16626 at commit 04ce8f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T05:23:50Z

Test build #74937 has finished for PR 16626 at commit 7d8437d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xwu0226 · 2017-03-21T05:26:44Z

retest this please

SparkQA · 2017-03-21T07:04:03Z

Test build #74943 has finished for PR 16626 at commit 7d8437d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-21T07:05:41Z

retest this please

SparkQA · 2017-03-21T09:24:02Z

Test build #74952 has finished for PR 16626 at commit 7d8437d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-21T15:56:58Z

Thanks! Merging to master.

…ce tables Support` ALTER TABLE ADD COLUMNS (...) `syntax for Hive serde and some datasource tables. In this PR, we consider a few aspects: 1. View is not supported for `ALTER ADD COLUMNS` 2. Since tables created in SparkSQL with Hive DDL syntax will populate table properties with schema information, we need make sure the consistency of the schema before and after ALTER operation in order for future use. 3. For embedded-schema type of format, such as `parquet`, we need to make sure that the predicate on the newly-added columns can be evaluated properly, or pushed down properly. In case of the data file does not have the columns for the newly-added columns, such predicates should return as if the column values are NULLs. 4. For datasource table, this feature does not support the following: 4.1 TEXT format, since there is only one default column `value` is inferred for text format data. 4.2 ORC format, since SparkSQL native ORC reader does not support the difference between user-specified-schema and inferred schema from ORC files. 4.3 Third party datasource types that implements RelationProvider, including the built-in JDBC format, since different implementations by the vendors may have different ways to dealing with schema. 4.4 Other datasource types, such as `parquet`, `json`, `csv`, `hive` are supported. 5. Column names being added can not be duplicate of any existing data column or partition column names. Case sensitivity is taken into consideration according to the sql configuration. 6. This feature also supports In-Memory catalog, while Hive support is turned off. Add new test cases Author: Xin Wu <[email protected]> Closes apache#16626 from xwu0226/alter_add_columns. (cherry picked from commit 4c0ff5f) (cherry picked from commit 297cfc9a604f9d098307e8c04e0aaafda87f5eff)

gatorsmile reviewed Jan 20, 2017

View reviewed changes

gatorsmile reviewed Jan 21, 2017

View reviewed changes

viirya reviewed Jan 21, 2017

View reviewed changes

viirya reviewed Jan 22, 2017

View reviewed changes

xwu0226 changed the title ~~[SPARK-19261][SQL] Alter add columns for Hive tables~~ [SPARK-19261][SQL] Alter add columns for Hive serde and some datasource tables Feb 3, 2017

xwu0226 force-pushed the alter_add_columns branch from 73b0243 to 26e0940 Compare February 3, 2017 20:25

xwu0226 force-pushed the alter_add_columns branch from 26e0940 to e2c53a2 Compare February 4, 2017 01:29

xwu0226 force-pushed the alter_add_columns branch from e2c53a2 to 88c2f48 Compare February 5, 2017 07:48

gatorsmile reviewed Feb 6, 2017

View reviewed changes

xwu0226 force-pushed the alter_add_columns branch from 88c2f48 to 0b7c0b1 Compare February 7, 2017 18:56

xwu0226 force-pushed the alter_add_columns branch from d67042f to 193c0c3 Compare February 18, 2017 04:30

some minor updates upon review comments

1eb7cd3

xwu0226 force-pushed the alter_add_columns branch from a28fc42 to 1eb7cd3 Compare March 19, 2017 06:17

cloud-fan reviewed Mar 21, 2017

View reviewed changes

update based on review

04ce8f4

gatorsmile reviewed Mar 21, 2017

View reviewed changes

update on minor comments

7d8437d

asfgit closed this in 4c0ff5f Mar 21, 2017

emanuelebardelli mentioned this pull request Jun 3, 2019

[SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns #24780

Closed

Uh oh!

[SPARK-19261][SQL] Alter add columns for Hive serde and some datasource tables #16626

[SPARK-19261][SQL] Alter add columns for Hive serde and some datasource tables #16626

Uh oh!

Conversation

xwu0226 commented Jan 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 18, 2017

Uh oh!

xwu0226 commented Jan 20, 2017

Uh oh!

SparkQA commented Jan 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwu0226 Jan 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jan 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2017

Uh oh!

SparkQA commented Feb 4, 2017

Uh oh!

SparkQA commented Feb 5, 2017

Uh oh!

gatorsmile commented Feb 6, 2017

Uh oh!

SparkQA commented Feb 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Feb 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Feb 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

xwu0226 commented Jan 17, 2017 •

edited

Loading

xwu0226 Jan 21, 2017 •

edited

Loading

viirya Jan 22, 2017 •

edited

Loading

gatorsmile Feb 6, 2017 •

edited

Loading

gatorsmile Feb 6, 2017 •

edited

Loading