Skip to content

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Dec 26, 2022

What changes were proposed in this pull request?

This pr aims to make ctas use a nested execution instead of running data writing cmmand.

So, we can clean up ctas itself to remove the unnecessary v1write information. Now, the v1writes only have two implementation: InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable

Why are the changes needed?

Make v1writes code clear.

EXPLAIN FORMATTED CREATE TABLE t2 USING PARQUET AS SELECT * FROM t;

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand (1)
   +- CreateDataSourceTableAsSelectCommand (2)
         +- Project (5)
            +- SubqueryAlias (4)
               +- LogicalRelation (3)

(1) Execute CreateDataSourceTableAsSelectCommand
Output: []

(2) CreateDataSourceTableAsSelectCommand
Arguments: `spark_catalog`.`default`.`t2`, ErrorIfExists, [c1, c2]

(3) LogicalRelation
Arguments: parquet, [c1#11, c2#12], `spark_catalog`.`default`.`t`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, false

(4) SubqueryAlias
Arguments: spark_catalog.default.t

(5) Project
Arguments: [c1#11, c2#12]

Does this PR introduce any user-facing change?

no

How was this patch tested?

improve existed test

@github-actions github-actions bot added the SQL label Dec 26, 2022
val options = table.storage.properties
V1WritesUtils.getSortOrder(outputColumns, partitionColumns, table.bucketSpec, options)
}
extends LeafRunnableCommand {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the key change. now ctas is not a v1 write command.

@ulysses-you
Copy link
Contributor Author

cc @cloud-fan

import org.apache.spark.util.Utils

trait CreateHiveTableAsSelectBase extends V1WriteCommand with V1WritesHiveUtils {
trait CreateHiveTableAsSelectBase extends LeafRunnableCommand {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need OptimizedCreateHiveTableAsSelectCommand? The nested InsertIntoHadoopFsRelationCommand should be optimized instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, but if we do not have a OptimizedCreateHiveTableAsSelectCommand, how can we get InsertIntoHadoopFsRelationCommand ..

The pipeline is: CreateHiveTableAsSelectCommand -> OptimizedCreateHiveTableAsSelectCommand -> InsertIntoHadoopFsRelationCommand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pipeline can be CreateHiveTableAsSelectCommand -> hive insertion command -> InsertIntoHadoopFsRelationCommand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to deal with config spark.sql.hive.convertMetastoreCtas ? we do not know if the hive insertion is from ctas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, just deprecated this config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove OptimizedCreateHiveTableAsSelectCommand is out of the scope of this pr, how about doing this in a separated pr ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

assert(planInfo.children.size == 1)
assert(planInfo.children.head.nodeName ==
"Execute CreateDataSourceTableAsSelectCommand")
"Execute InsertIntoHadoopFsRelationCommand")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we check 2 items? One is CreateDataSourceTableAsSelectCommand and the other is InsertIntoHadoopFsRelationCommand

assert(commands(4)._1 == "command")
assert(commands(4)._2.isInstanceOf[CreateDataSourceTableAsSelectCommand])
assert(commands(4)._2.asInstanceOf[CreateDataSourceTableAsSelectCommand]
assert(commands.length == 6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a comment to explain it?

"InsertIntoHiveTable",
"Limit",
"src")
"== Physical Plan ==")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we at least check something?


override def fileFormatProvider: Boolean = {
table.provider.forall { provider =>
classOf[FileFormat].isAssignableFrom(DataSource.providingClass(provider, conf))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we revert the change in DataSource.scala as well?

@ulysses-you
Copy link
Contributor Author

@cloud-fan addressed all comments

}
extends LeafRunnableCommand {
assert(query.resolved)
override def innerChildren: Seq[LogicalPlan] = query :: Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put the EXPLAIN result of a CTAS in the PR description as an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 4b40920 Dec 28, 2022
@ulysses-you ulysses-you deleted the SPARK-41713 branch December 28, 2022 10:09
cloud-fan pushed a commit that referenced this pull request Jan 10, 2023
### What changes were proposed in this pull request?
This PR proposes to group all sub-executions together in SQL UI if they belong to the same root execution.

This feature is controlled by conf `spark.ui.sql.groupSubExecutionEnabled` and the default value is set to `true`

We can have some follow-up improvements after this PR:
1. Add links to SQL page and Job page to indicate the root execution ID.
2. Better handling for the root execution missing case (e.g. eviction due to retaining limit). In this PR, the sub-executions will be displayed ungrouped.

### Why are the changes needed?
better user experience.

In PR #39220, the CTAS query will trigger a sub-execution to perform the data insertion. But the current UI will display the two executions separately which may confuse the users.
In addition, this change should also help the structured streaming cases

### Does this PR introduce _any_ user-facing change?
Yes, the screenshot of the UI change is shown below
SQL Query:
```
CREATE TABLE t USING PARQUET AS SELECT 'a' as a, 1 as b
```
UI before this PR
<img width="1074" alt="Screen Shot 2022-12-28 at 4 42 08 PM" src="https://user-images.githubusercontent.com/67896261/209889679-83909bc9-0e15-4ff1-9aeb-3118e4bab524.png">

UI after this PR with sub executions collapsed
<img width="1072" alt="Screen Shot 2022-12-28 at 4 44 32 PM" src="https://user-images.githubusercontent.com/67896261/209889688-973a4ec9-a5dc-4a8b-8618-c0800733fffa.png">

UI after this PR with sub execution expanded
<img width="1069" alt="Screen Shot 2022-12-28 at 4 44 41 PM" src="https://user-images.githubusercontent.com/67896261/209889718-0e24be12-23d6-4f81-a508-15eac62ec231.png">

### How was this patch tested?
UT

Closes #39268 from linhongliu-db/SPARK-41752.

Authored-by: Linhong Liu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Nov 13, 2023
…rom `DataSource`

### What changes were proposed in this pull request?
`resolvePartitionColumns` was introduced by SPARK-37287 (#37099) and become unused after SPARK-41713 (#39220), so this pr remove it from `DataSource`.

### Why are the changes needed?
Clean up unused code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43779 from LuciferYang/SPARK-45902.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants