-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-4131] Support "Writing data into the filesystem from queries" #18975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
| * operation to the logical plan. | ||
| */ | ||
| protected override def withInsertInto(ctx: InsertIntoContext, | ||
| query: LogicalPlan): LogicalPlan = withOrigin(ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
@janewangfb Thank you for working on it! The implementation in the current PR is very specific to Hive table. To support such a command, could you also support data source tables? |
|
Test build #80804 has finished for PR 18975 at commit
|
|
@gatorsmile Originally, because we have alot of hive sqls that we wanted to support in spark, I implemented hive syntax for this command: But now I see that in SparkSqlParser.scala, we have both visitCreateTable and visitCreateHiveTable. |
|
Since our native data source tables perform faster than the Hive serde tables, we expect our Spark users might prefer using data source tables. Thanks for your work! |
|
Test build #80874 has finished for PR 18975 at commit
|
|
still need to implement the data source table portion. |
|
Test build #80917 has finished for PR 18975 at commit
|
|
Added the support for write out data source format. |
| ctx: InsertOverwriteDirContext): InsertDirParams = withOrigin(ctx) { | ||
| if (ctx.LOCAL != null) { | ||
| throw new ParseException( | ||
| "LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source", ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't support LOCAL for data source, should we remove it from the parsing rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, LOCAL was not added.
@gatorsmile had some comment that the parser might have some weird exception and he requested to add it.
| tmpFile => fs.rename(tmpFile.getPath, writeToPath) | ||
| } | ||
|
|
||
| deleteExternalTmpPath(hadoopConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also try to remove the external tmp path when an exception happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. updated.
|
Looks pretty well, left few minor comments. Thanks for working on this. |
|
Test build #81576 has finished for PR 18975 at commit
|
|
LGTM pending Jenkins Thanks again! |
|
Test build #81593 has finished for PR 18975 at commit
|
|
Test build #81594 has finished for PR 18975 at commit
|
|
Thanks! Merged to master. |
| isLocal: Boolean, | ||
| storage: CatalogStorageFormat, | ||
| query: LogicalPlan, | ||
| overwrite: Boolean) extends SaveAsHiveFile with HiveTmpPath { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we separate SaveAsHiveFile and HiveTmpPath, while we always use them together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will submit a follow-up PR soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan and gatorsmile, I will merge them together and submit a PR.
## What changes were proposed in this pull request? The code is already merged to master: apache#18975 This is a following up PR to merge HiveTmpFile.scala to SaveAsHiveFile. ## How was this patch tested? Build successfully Author: Jane Wang <[email protected]> Closes apache#19221 from janewangfb/merge_savehivefile_hivetmpfile.
…m queries" ## What changes were proposed in this pull request? This PR is clean the codes in apache#18975 ## How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes apache#19225 from gatorsmile/refactorSPARK-4131.
| val saveMode = if (overwrite) SaveMode.Overwrite else SaveMode.ErrorIfExists | ||
| try { | ||
| sparkSession.sessionState.executePlan(dataSource.planForWriting(saveMode, query)) | ||
| dataSource.writeAndRead(saveMode, query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation here confused me, just want to leave a question here why we should call both writeAndRead and planForWriting?
@janewangfb @gatorsmile @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We should get rid of dataSource.writeAndRead @xuanyuanking Could you submit a PR to fix the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Thanks for you reply, I'll try to fix this.
## What changes were proposed in this pull request? As the discussion in #16481 and #18975 (comment) Currently the BaseRelation returned by `dataSource.writeAndRead` only used in `CreateDataSourceTableAsSelect`, planForWriting and writeAndRead has some common code paths. In this patch I removed the writeAndRead function and added the getRelation function which only use in `CreateDataSourceTableAsSelectCommand` while saving data to non-existing table. ## How was this patch tested? Existing UT Author: Yuanjian Li <[email protected]> Closes #19941 from xuanyuanking/SPARK-22753.
|
@gatorsmile @janewangfb i have a question as i see insert overwrite directory 'file:///opt/table2' using parquet select * from table1; is still ok |
…a source ### What changes were proposed in this pull request? `INSERT OVERWRITE LOCAL DIRECTORY` is supported with ensuring the provided path is always using `file://` as scheme and removing the check which throws exception if we do insert overwrite by mentioning directory with `LOCAL` syntax ### Why are the changes needed? without the modification in PR, ``` insert overwrite local directory <location> using ``` throws exception ``` Error: org.apache.spark.sql.catalyst.parser.ParseException: LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 0) ``` which was introduced in #18975, but this restriction is not needed, hence dropping the same. Keep behaviour consistent for local and remote file-system in `INSERT OVERWRITE DIRECTORY` ### Does this PR introduce any user-facing change? Yes, after this change `INSERT OVERWRITE LOCAL DIRECTORY` will not throw exception ### How was this patch tested? Added UT Closes #27039 from ajithme/insertoverwrite2. Authored-by: Ajith <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…a source ### What changes were proposed in this pull request? `INSERT OVERWRITE LOCAL DIRECTORY` is supported with ensuring the provided path is always using `file://` as scheme and removing the check which throws exception if we do insert overwrite by mentioning directory with `LOCAL` syntax ### Why are the changes needed? without the modification in PR, ``` insert overwrite local directory <location> using ``` throws exception ``` Error: org.apache.spark.sql.catalyst.parser.ParseException: LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 0) ``` which was introduced in apache#18975, but this restriction is not needed, hence dropping the same. Keep behaviour consistent for local and remote file-system in `INSERT OVERWRITE DIRECTORY` ### Does this PR introduce any user-facing change? Yes, after this change `INSERT OVERWRITE LOCAL DIRECTORY` will not throw exception ### How was this patch tested? Added UT Closes apache#27039 from ajithme/insertoverwrite2. Authored-by: Ajith <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR implements the sql feature:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
How was this patch tested?
Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory.