[SPARK-41713][SQL] Make CTAS hold a nested execution for data writing #39220

ulysses-you · 2022-12-26T08:46:30Z

What changes were proposed in this pull request?

This pr aims to make ctas use a nested execution instead of running data writing cmmand.

So, we can clean up ctas itself to remove the unnecessary v1write information. Now, the v1writes only have two implementation: InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable

Why are the changes needed?

Make v1writes code clear.

EXPLAIN FORMATTED CREATE TABLE t2 USING PARQUET AS SELECT * FROM t;

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand (1)
   +- CreateDataSourceTableAsSelectCommand (2)
         +- Project (5)
            +- SubqueryAlias (4)
               +- LogicalRelation (3)

(1) Execute CreateDataSourceTableAsSelectCommand
Output: []

(2) CreateDataSourceTableAsSelectCommand
Arguments: `spark_catalog`.`default`.`t2`, ErrorIfExists, [c1, c2]

(3) LogicalRelation
Arguments: parquet, [c1#11, c2#12], `spark_catalog`.`default`.`t`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, false

(4) SubqueryAlias
Arguments: spark_catalog.default.t

(5) Project
Arguments: [c1#11, c2#12]

Does this PR introduce any user-facing change?

no

How was this patch tested?

improve existed test

ulysses-you · 2022-12-27T02:47:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

-    val options = table.storage.properties
-    V1WritesUtils.getSortOrder(outputColumns, partitionColumns, table.bucketSpec, options)
-  }
+  extends LeafRunnableCommand {


this is the key change. now ctas is not a v1 write command.

ulysses-you · 2022-12-27T02:47:57Z

cc @cloud-fan

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

cloud-fan · 2022-12-27T03:19:39Z

...hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala

 import org.apache.spark.util.Utils

-trait CreateHiveTableAsSelectBase extends V1WriteCommand with V1WritesHiveUtils {
+trait CreateHiveTableAsSelectBase extends LeafRunnableCommand {


do we still need OptimizedCreateHiveTableAsSelectCommand? The nested InsertIntoHadoopFsRelationCommand should be optimized instead.

good question, but if we do not have a OptimizedCreateHiveTableAsSelectCommand, how can we get InsertIntoHadoopFsRelationCommand ..

The pipeline is: CreateHiveTableAsSelectCommand -> OptimizedCreateHiveTableAsSelectCommand -> InsertIntoHadoopFsRelationCommand

The new pipeline can be CreateHiveTableAsSelectCommand -> hive insertion command -> InsertIntoHadoopFsRelationCommand

how to deal with config spark.sql.hive.convertMetastoreCtas ? we do not know if the hive insertion is from ctas.

or, just deprecated this config

remove OptimizedCreateHiveTableAsSelectCommand is out of the scope of this pr, how about doing this in a separated pr ?

cloud-fan · 2022-12-28T03:57:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

                assert(planInfo.children.size == 1)
                assert(planInfo.children.head.nodeName ==
-                  "Execute CreateDataSourceTableAsSelectCommand")
+                  "Execute InsertIntoHadoopFsRelationCommand")


shall we check 2 items? One is CreateDataSourceTableAsSelectCommand and the other is InsertIntoHadoopFsRelationCommand

cloud-fan · 2022-12-28T03:58:57Z

sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala

-      assert(commands(4)._1 == "command")
-      assert(commands(4)._2.isInstanceOf[CreateDataSourceTableAsSelectCommand])
-      assert(commands(4)._2.asInstanceOf[CreateDataSourceTableAsSelectCommand]
+      assert(commands.length == 6)


shall we add a comment to explain it?

cloud-fan · 2022-12-28T04:00:45Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala

-                   "InsertIntoHiveTable",
-                   "Limit",
-                   "src")
+                   "== Physical Plan ==")


shall we at least check something?

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

cloud-fan · 2022-12-28T04:36:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

-
-  override def fileFormatProvider: Boolean = {
-    table.provider.forall { provider =>
-      classOf[FileFormat].isAssignableFrom(DataSource.providingClass(provider, conf))


shall we revert the change in DataSource.scala as well?

ulysses-you · 2022-12-28T06:09:06Z

@cloud-fan addressed all comments

cloud-fan · 2022-12-28T06:23:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

-  }
+  extends LeafRunnableCommand {
+  assert(query.resolved)
+  override def innerChildren: Seq[LogicalPlan] = query :: Nil


can we put the EXPLAIN result of a CTAS in the PR description as an example?

sure, updated

cloud-fan · 2022-12-28T09:11:52Z

thanks, merging to master!

### What changes were proposed in this pull request? This PR proposes to group all sub-executions together in SQL UI if they belong to the same root execution. This feature is controlled by conf `spark.ui.sql.groupSubExecutionEnabled` and the default value is set to `true` We can have some follow-up improvements after this PR: 1. Add links to SQL page and Job page to indicate the root execution ID. 2. Better handling for the root execution missing case (e.g. eviction due to retaining limit). In this PR, the sub-executions will be displayed ungrouped. ### Why are the changes needed? better user experience. In PR #39220, the CTAS query will trigger a sub-execution to perform the data insertion. But the current UI will display the two executions separately which may confuse the users. In addition, this change should also help the structured streaming cases ### Does this PR introduce _any_ user-facing change? Yes, the screenshot of the UI change is shown below SQL Query: ``` CREATE TABLE t USING PARQUET AS SELECT 'a' as a, 1 as b ``` UI before this PR <img width="1074" alt="Screen Shot 2022-12-28 at 4 42 08 PM" src="https://user-images.githubusercontent.com/67896261/209889679-83909bc9-0e15-4ff1-9aeb-3118e4bab524.png"> UI after this PR with sub executions collapsed <img width="1072" alt="Screen Shot 2022-12-28 at 4 44 32 PM" src="https://user-images.githubusercontent.com/67896261/209889688-973a4ec9-a5dc-4a8b-8618-c0800733fffa.png"> UI after this PR with sub execution expanded <img width="1069" alt="Screen Shot 2022-12-28 at 4 44 41 PM" src="https://user-images.githubusercontent.com/67896261/209889718-0e24be12-23d6-4f81-a508-15eac62ec231.png"> ### How was this patch tested? UT Closes #39268 from linhongliu-db/SPARK-41752. Authored-by: Linhong Liu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…rom `DataSource` ### What changes were proposed in this pull request? `resolvePartitionColumns` was introduced by SPARK-37287 (#37099) and become unused after SPARK-41713 (#39220), so this pr remove it from `DataSource`. ### Why are the changes needed? Clean up unused code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43779 from LuciferYang/SPARK-45902. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the SQL label Dec 26, 2022

Make CTAS hold a nested execution for data writing

66ceabc

ulysses-you force-pushed the SPARK-41713 branch from 5d05abc to 66ceabc Compare December 27, 2022 02:46

ulysses-you commented Dec 27, 2022

View reviewed changes

cloud-fan reviewed Dec 27, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala Show resolved Hide resolved

cloud-fan reviewed Dec 27, 2022

View reviewed changes

ulysses-you added 2 commits December 27, 2022 12:26

assert

b7e8247

fix test

278fdfc

cloud-fan reviewed Dec 28, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala Show resolved Hide resolved

cloud-fan reviewed Dec 28, 2022

View reviewed changes

address comments

2f95ca4

cloud-fan reviewed Dec 28, 2022

View reviewed changes

cloud-fan closed this in 4b40920 Dec 28, 2022

ulysses-you deleted the SPARK-41713 branch December 28, 2022 10:09

linhongliu-db mentioned this pull request Dec 29, 2022

[SPARK-41752][SQL][UI] Group nested executions under the root execution #39268

Closed

ted-jenks mentioned this pull request May 30, 2023

[SPARK-43883][SQL] Make CTAS Have a UnaryRunnableCommand Trait Supporting Children #41386

Closed

LuciferYang mentioned this pull request Nov 13, 2023

[SPARK-45902][SQL] Remove unused function resolvePartitionColumns from DataSource #43779

Closed

[SPARK-41713][SQL] Make CTAS hold a nested execution for data writing #39220

[SPARK-41713][SQL] Make CTAS hold a nested execution for data writing #39220

Uh oh!

Conversation

ulysses-you commented Dec 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Dec 27, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Dec 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ulysses-you commented Dec 26, 2022 •

edited

Loading