[SPARK-4131] Support "Writing data into the filesystem from queries" #18975

janewangfb · 2017-08-17T18:31:37Z

What changes were proposed in this pull request?

This PR implements the sql feature:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...

How was this patch tested?

Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory.

gatorsmile · 2017-08-17T20:08:01Z

ok to test

gatorsmile · 2017-08-17T20:08:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+   * operation to the logical plan.
+   */
+  protected override def withInsertInto(ctx: InsertIntoContext,
+                                        query: LogicalPlan): LogicalPlan = withOrigin(ctx) {


gatorsmile · 2017-08-17T20:12:02Z

@janewangfb Thank you for working on it! The implementation in the current PR is very specific to Hive table. To support such a command, could you also support data source tables?

SparkQA · 2017-08-17T22:53:53Z

Test build #80804 has finished for PR 18975 at commit b9db02e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

janewangfb · 2017-08-18T03:58:56Z

@gatorsmile Originally, because we have alot of hive sqls that we wanted to support in spark, I implemented hive syntax for this command:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

But now I see that in SparkSqlParser.scala, we have both visitCreateTable and visitCreateHiveTable.
I think we could implement both for this command.

gatorsmile · 2017-08-18T07:30:24Z

Since our native data source tables perform faster than the Hive serde tables, we expect our Spark users might prefer using data source tables.

Thanks for your work!

SparkQA · 2017-08-19T16:39:41Z

Test build #80874 has finished for PR 18975 at commit 7f5664d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

janewangfb · 2017-08-21T03:58:04Z

still need to implement the data source table portion.

SparkQA · 2017-08-21T06:39:20Z

Test build #80917 has finished for PR 18975 at commit 068662a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnalyzePartitionCommand(

janewangfb · 2017-08-21T21:01:22Z

Added the support for write out data source format.

viirya · 2017-09-09T05:36:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+      ctx: InsertOverwriteDirContext): InsertDirParams = withOrigin(ctx) {
+    if (ctx.LOCAL != null) {
+      throw new ParseException(
+        "LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source", ctx)


If we don't support LOCAL for data source, should we remove it from the parsing rule?

Originally, LOCAL was not added.
@gatorsmile had some comment that the parser might have some weird exception and he requested to add it.

viirya · 2017-09-09T05:40:34Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala

+        tmpFile => fs.rename(tmpFile.getPath, writeToPath)
+      }
+
+      deleteExternalTmpPath(hadoopConf)


We should also try to remove the external tmp path when an exception happens.

good point. updated.

viirya · 2017-09-09T05:43:58Z

Looks pretty well, left few minor comments. Thanks for working on this.

SparkQA · 2017-09-09T07:04:46Z

Test build #81576 has finished for PR 18975 at commit 7919041.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-09T16:43:27Z

LGTM pending Jenkins

Thanks again!

SparkQA · 2017-09-09T18:42:03Z

Test build #81593 has finished for PR 18975 at commit 81382df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-09T18:46:01Z

Test build #81594 has finished for PR 18975 at commit f93d57a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-09T18:52:24Z

Thanks! Merged to master.

cloud-fan · 2017-09-13T05:12:53Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala

+    isLocal: Boolean,
+    storage: CatalogStorageFormat,
+    query: LogicalPlan,
+    overwrite: Boolean) extends SaveAsHiveFile with HiveTmpPath {


why do we separate SaveAsHiveFile and HiveTmpPath, while we always use them together?

Sure, will submit a follow-up PR soon.

@cloud-fan and gatorsmile, I will merge them together and submit a PR.

## What changes were proposed in this pull request? The code is already merged to master: apache#18975 This is a following up PR to merge HiveTmpFile.scala to SaveAsHiveFile. ## How was this patch tested? Build successfully Author: Jane Wang <[email protected]> Closes apache#19221 from janewangfb/merge_savehivefile_hivetmpfile.

…m queries" ## What changes were proposed in this pull request? This PR is clean the codes in apache#18975 ## How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes apache#19225 from gatorsmile/refactorSPARK-4131.

xuanyuanking · 2017-12-06T09:38:46Z

...e/src/main/scala/org/apache/spark/sql/execution/command/InsertIntoDataSourceDirCommand.scala

+    val saveMode = if (overwrite) SaveMode.Overwrite else SaveMode.ErrorIfExists
+    try {
+      sparkSession.sessionState.executePlan(dataSource.planForWriting(saveMode, query))
+      dataSource.writeAndRead(saveMode, query)


The implementation here confused me, just want to leave a question here why we should call both writeAndRead and planForWriting?
@janewangfb @gatorsmile @cloud-fan

Yes. We should get rid of dataSource.writeAndRead @xuanyuanking Could you submit a PR to fix the issue?

@gatorsmile Thanks for you reply, I'll try to fix this.

## What changes were proposed in this pull request? As the discussion in #16481 and #18975 (comment) Currently the BaseRelation returned by `dataSource.writeAndRead` only used in `CreateDataSourceTableAsSelect`, planForWriting and writeAndRead has some common code paths. In this patch I removed the writeAndRead function and added the getRelation function which only use in `CreateDataSourceTableAsSelectCommand` while saving data to non-existing table. ## How was this patch tested? Existing UT Author: Yuanjian Li <[email protected]> Closes #19941 from xuanyuanking/SPARK-22753.

ajithme · 2019-08-21T01:48:07Z

#18975 (comment)

@gatorsmile @janewangfb i have a question
Is there any particular reason why LOCAL was excluded.?
LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source

as i see insert overwrite directory 'file:///opt/table2' using parquet select * from table1; is still ok

…a source ### What changes were proposed in this pull request? `INSERT OVERWRITE LOCAL DIRECTORY` is supported with ensuring the provided path is always using `file://` as scheme and removing the check which throws exception if we do insert overwrite by mentioning directory with `LOCAL` syntax ### Why are the changes needed? without the modification in PR, ``` insert overwrite local directory <location> using ``` throws exception ``` Error: org.apache.spark.sql.catalyst.parser.ParseException: LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 0) ``` which was introduced in #18975, but this restriction is not needed, hence dropping the same. Keep behaviour consistent for local and remote file-system in `INSERT OVERWRITE DIRECTORY` ### Does this PR introduce any user-facing change? Yes, after this change `INSERT OVERWRITE LOCAL DIRECTORY` will not throw exception ### How was this patch tested? Added UT Closes #27039 from ajithme/insertoverwrite2. Authored-by: Ajith <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…a source ### What changes were proposed in this pull request? `INSERT OVERWRITE LOCAL DIRECTORY` is supported with ensuring the provided path is always using `file://` as scheme and removing the check which throws exception if we do insert overwrite by mentioning directory with `LOCAL` syntax ### Why are the changes needed? without the modification in PR, ``` insert overwrite local directory <location> using ``` throws exception ``` Error: org.apache.spark.sql.catalyst.parser.ParseException: LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 0) ``` which was introduced in apache#18975, but this restriction is not needed, hence dropping the same. Keep behaviour consistent for local and remote file-system in `INSERT OVERWRITE DIRECTORY` ### Does this PR introduce any user-facing change? Yes, after this change `INSERT OVERWRITE LOCAL DIRECTORY` will not throw exception ### How was this patch tested? Added UT Closes apache#27039 from ajithme/insertoverwrite2. Authored-by: Ajith <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

janewangfb added 5 commits August 16, 2017 18:32

add insert overwrite local directory

6ca7771

Add Unittests

a975536

fix local path

a15bf4e

Merge branch 'master' into port_local_directory

9f596fd

fix style

b9db02e

gatorsmile reviewed Aug 17, 2017

View reviewed changes

janewangfb added 3 commits August 17, 2017 21:06

Merge branch 'master' into port_local_directory

e516bec

condense storage

e05624f

change InsertInto to InsertIntoTable

7f5664d

janewangfb added 2 commits August 19, 2017 09:25

add InsertIntoDirectory

d50b3a2

update insertInto

61a18a2

janewangfb added 3 commits August 19, 2017 11:16

SQLQuerySuite passed

4c19aaf

fix comments

47fde8a

Merge branch 'master' into port_local_directory

068662a

janewangfb closed this Aug 21, 2017

janewangfb added 4 commits August 21, 2017 09:35

Add tableProdier

da7065b

Add InsertIntoDataSourceDirCommand

7f4b488

Merge branch 'master' into port_local_directory

051018e

fix style

73f605e

janewangfb reopened this Aug 21, 2017

fix style

8261b39

viirya reviewed Sep 9, 2017

View reviewed changes

janewangfb added 3 commits September 9, 2017 08:46

address viirya's comment

aeb5d5e

Merge branch 'master' into port_local_directory

81382df

address viirya's comment

f93d57a

asfgit closed this in f767905 Sep 9, 2017

cloud-fan reviewed Sep 13, 2017

View reviewed changes

janewangfb mentioned this pull request Sep 13, 2017

[SPARK-4131] Merge HiveTmpFile.scala to SaveAsHiveFile.scala #19221

Closed

gatorsmile mentioned this pull request Sep 13, 2017

[SPARK-4131] [Follow-up] Support "Writing data into the filesystem from queries" #19225

Closed

xuanyuanking reviewed Dec 6, 2017

View reviewed changes

xuanyuanking mentioned this pull request Dec 11, 2017

[SPARK-22753][SQL] Get rid of dataSource.writeAndRead #19941

Closed

ajithme mentioned this pull request Dec 29, 2019

[SPARK-29174][SQL] Support LOCAL in INSERT OVERWRITE DIRECTORY to data source #27039

Closed

Uh oh!

[SPARK-4131] Support "Writing data into the filesystem from queries" #18975

[SPARK-4131] Support "Writing data into the filesystem from queries" #18975

Uh oh!

Conversation

janewangfb commented Aug 17, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Aug 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 17, 2017

Uh oh!

SparkQA commented Aug 17, 2017

Uh oh!

janewangfb commented Aug 18, 2017

Uh oh!

gatorsmile commented Aug 18, 2017

Uh oh!

SparkQA commented Aug 19, 2017

Uh oh!

janewangfb commented Aug 21, 2017

Uh oh!

SparkQA commented Aug 21, 2017

Uh oh!

janewangfb commented Aug 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Sep 9, 2017

Uh oh!

SparkQA commented Sep 9, 2017

Uh oh!

gatorsmile commented Sep 9, 2017

Uh oh!

SparkQA commented Sep 9, 2017

Uh oh!

SparkQA commented Sep 9, 2017

Uh oh!

gatorsmile commented Sep 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajithme commented Aug 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ajithme commented Aug 21, 2019 •

edited

Loading