Skip to content

Conversation

@janewangfb
Copy link
Contributor

What changes were proposed in this pull request?

This PR implements the sql feature:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...

How was this patch tested?

Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory.

@gatorsmile
Copy link
Member

ok to test

* operation to the logical plan.
*/
protected override def withInsertInto(ctx: InsertIntoContext,
query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@gatorsmile
Copy link
Member

@janewangfb Thank you for working on it! The implementation in the current PR is very specific to Hive table. To support such a command, could you also support data source tables?

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80804 has finished for PR 18975 at commit b9db02e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@janewangfb
Copy link
Contributor Author

@gatorsmile Originally, because we have alot of hive sqls that we wanted to support in spark, I implemented hive syntax for this command:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

But now I see that in SparkSqlParser.scala, we have both visitCreateTable and visitCreateHiveTable.
I think we could implement both for this command.

@gatorsmile
Copy link
Member

Since our native data source tables perform faster than the Hive serde tables, we expect our Spark users might prefer using data source tables.

Thanks for your work!

@SparkQA
Copy link

SparkQA commented Aug 19, 2017

Test build #80874 has finished for PR 18975 at commit 7f5664d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@janewangfb janewangfb closed this Aug 21, 2017
@janewangfb
Copy link
Contributor Author

still need to implement the data source table portion.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80917 has finished for PR 18975 at commit 068662a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AnalyzePartitionCommand(

@janewangfb
Copy link
Contributor Author

Added the support for write out data source format.

@janewangfb janewangfb reopened this Aug 21, 2017
ctx: InsertOverwriteDirContext): InsertDirParams = withOrigin(ctx) {
if (ctx.LOCAL != null) {
throw new ParseException(
"LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source", ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't support LOCAL for data source, should we remove it from the parsing rule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, LOCAL was not added.
@gatorsmile had some comment that the parser might have some weird exception and he requested to add it.

tmpFile => fs.rename(tmpFile.getPath, writeToPath)
}

deleteExternalTmpPath(hadoopConf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also try to remove the external tmp path when an exception happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. updated.

@viirya
Copy link
Member

viirya commented Sep 9, 2017

Looks pretty well, left few minor comments. Thanks for working on this.

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81576 has finished for PR 18975 at commit 7919041.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM pending Jenkins

Thanks again!

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81593 has finished for PR 18975 at commit 81382df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81594 has finished for PR 18975 at commit f93d57a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in f767905 Sep 9, 2017
@gatorsmile
Copy link
Member

Thanks! Merged to master.

isLocal: Boolean,
storage: CatalogStorageFormat,
query: LogicalPlan,
overwrite: Boolean) extends SaveAsHiveFile with HiveTmpPath {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we separate SaveAsHiveFile and HiveTmpPath, while we always use them together?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will submit a follow-up PR soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan and gatorsmile, I will merge them together and submit a PR.

ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 13, 2017
## What changes were proposed in this pull request?

The code is already merged to master:
apache#18975

This is a following up PR to merge HiveTmpFile.scala to SaveAsHiveFile.

## How was this patch tested?

Build successfully

Author: Jane Wang <[email protected]>

Closes apache#19221 from janewangfb/merge_savehivefile_hivetmpfile.
ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 14, 2017
…m queries"

## What changes were proposed in this pull request?
This PR is clean the codes in apache#18975

## How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes apache#19225 from gatorsmile/refactorSPARK-4131.
val saveMode = if (overwrite) SaveMode.Overwrite else SaveMode.ErrorIfExists
try {
sparkSession.sessionState.executePlan(dataSource.planForWriting(saveMode, query))
dataSource.writeAndRead(saveMode, query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here confused me, just want to leave a question here why we should call both writeAndRead and planForWriting?
@janewangfb @gatorsmile @cloud-fan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should get rid of dataSource.writeAndRead @xuanyuanking Could you submit a PR to fix the issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thanks for you reply, I'll try to fix this.

asfgit pushed a commit that referenced this pull request Dec 15, 2017
## What changes were proposed in this pull request?

As the discussion in #16481 and #18975 (comment)
Currently the BaseRelation returned by `dataSource.writeAndRead` only used in `CreateDataSourceTableAsSelect`, planForWriting and writeAndRead has some common code paths.
In this patch I removed the writeAndRead function and added the getRelation function which only use in `CreateDataSourceTableAsSelectCommand` while saving data to non-existing table.

## How was this patch tested?

Existing UT

Author: Yuanjian Li <[email protected]>

Closes #19941 from xuanyuanking/SPARK-22753.
@ajithme
Copy link
Contributor

ajithme commented Aug 21, 2019

#18975 (comment)

@gatorsmile @janewangfb i have a question
Is there any particular reason why LOCAL was excluded.?
LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source

as i see insert overwrite directory 'file:///opt/table2' using parquet select * from table1; is still ok

HyukjinKwon pushed a commit that referenced this pull request Feb 18, 2020
…a source

### What changes were proposed in this pull request?
`INSERT OVERWRITE LOCAL DIRECTORY` is supported with ensuring the provided path is always using `file://` as scheme and removing the check which throws exception if we do insert overwrite by mentioning directory with `LOCAL` syntax

### Why are the changes needed?
without the modification in PR, ``` insert overwrite local directory <location> using ```

throws exception

```
Error: org.apache.spark.sql.catalyst.parser.ParseException:

LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 0)
```
which was introduced in #18975, but this restriction is not needed, hence dropping the same.
Keep behaviour consistent for local and remote file-system in  `INSERT OVERWRITE DIRECTORY`

### Does this PR introduce any user-facing change?
Yes, after this change `INSERT OVERWRITE LOCAL DIRECTORY` will not throw exception

### How was this patch tested?
Added UT

Closes #27039 from ajithme/insertoverwrite2.

Authored-by: Ajith <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…a source

### What changes were proposed in this pull request?
`INSERT OVERWRITE LOCAL DIRECTORY` is supported with ensuring the provided path is always using `file://` as scheme and removing the check which throws exception if we do insert overwrite by mentioning directory with `LOCAL` syntax

### Why are the changes needed?
without the modification in PR, ``` insert overwrite local directory <location> using ```

throws exception

```
Error: org.apache.spark.sql.catalyst.parser.ParseException:

LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 0)
```
which was introduced in apache#18975, but this restriction is not needed, hence dropping the same.
Keep behaviour consistent for local and remote file-system in  `INSERT OVERWRITE DIRECTORY`

### Does this PR introduce any user-facing change?
Yes, after this change `INSERT OVERWRITE LOCAL DIRECTORY` will not throw exception

### How was this patch tested?
Added UT

Closes apache#27039 from ajithme/insertoverwrite2.

Authored-by: Ajith <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants