[SPARK-18917][SQL] Add Skip Partition Check Flag to avoid list all leaf files in append mode #16339

alunarbeach · 2016-12-19T18:31:15Z

What changes were proposed in this pull request?

Currently saving a dataframe in append mode lists all leaf files in save directory. When the directory is in object stores object stores (S3 / Google Storage) and has many subfolders due to partitioning, the writes are taking a long time to write or they result in read time out.
This pull request introduces a skip flag that is false by default and can be enabled by users to skip partition checking.

How was this patch tested?

This patch was tested using manual tests and regression tests.

AmplabJenkins · 2016-12-19T18:32:16Z

Can one of the admins verify this patch?

alunarbeach · 2016-12-19T18:32:23Z

@dongjoon-hyun @rxin @cloud-fan @tdas will you be able to review this?

Gauravshah · 2017-01-16T09:55:45Z

should help us save 20 mins on each iteration scanning directories.

cloud-fan · 2017-01-16T11:18:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala


        // If we are appending to a table that already exists, make sure the partitioning matches
        // up.  If we fail to load the table for whatever reason, ignore the check.
-        if (mode == SaveMode.Append) {


shall we just remove this check? it's too expensive. cc @marmbrus

It seems fine to remove in the case of files. Can we keep the track for catalog tables?

yea, for catalog tables, we always do the check, as it's cheap: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L92-L94

Then, we can remove the parm justPartitioning from the function getOrInferFileFormatSchema

cloud-fan · 2017-01-17T02:44:35Z

cc @alunarbeach can you update this PR? basically we don't need to add a flag but just remove that check.

rxin · 2017-01-17T19:02:42Z

I submitted a pr here #16622

alunarbeach · 2017-01-18T14:55:35Z

Thanks Team. Deleting the branch.

In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check. Note that for catalog tables, we always do the check, as discussed in apache#16339 (comment) backport apache#16622 to our internal 2.1 branch. Author: Reynold Xin <[email protected]> Closes apache#178 from cloud-fan/backport.

## What changes were proposed in this pull request? In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check. Note that for catalog tables, we always do the check, as discussed in apache#16339 (comment) ## How was this patch tested? N/A Closes apache#16339. Author: Reynold Xin <[email protected]> Closes apache#16622 from rxin/SPARK-18917.

alunarbeach added 2 commits December 19, 2016 13:12

add skip flag to skip partition check

885b721

change description

43e599e

merlintang mentioned this pull request Dec 29, 2016

[SPARK-18372][SQL][Branch-1.6].Staging directory fail to be removed #15819

Closed

cloud-fan reviewed Jan 16, 2017

View reviewed changes

rxin mentioned this pull request Jan 17, 2017

[SPARK-18917][SQL] Remove schema check in appending data #16622

Closed

asfgit closed this in 83dff87 Jan 17, 2017

alunarbeach deleted the spark-18917 branch January 18, 2017 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18917][SQL] Add Skip Partition Check Flag to avoid list all leaf files in append mode #16339

[SPARK-18917][SQL] Add Skip Partition Check Flag to avoid list all leaf files in append mode #16339

Uh oh!

alunarbeach commented Dec 19, 2016 •

edited

Loading

Uh oh!

AmplabJenkins commented Dec 19, 2016

Uh oh!

alunarbeach commented Dec 19, 2016 •

edited

Loading

Uh oh!

Gauravshah commented Jan 16, 2017

Uh oh!

cloud-fan Jan 16, 2017

Uh oh!

rxin Jan 16, 2017

Uh oh!

cloud-fan Jan 17, 2017

Uh oh!

gatorsmile Jan 17, 2017

Uh oh!

cloud-fan commented Jan 17, 2017

Uh oh!

rxin commented Jan 17, 2017

Uh oh!

alunarbeach commented Jan 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-18917][SQL] Add Skip Partition Check Flag to avoid list all leaf files in append mode #16339

[SPARK-18917][SQL] Add Skip Partition Check Flag to avoid list all leaf files in append mode #16339

Uh oh!

Conversation

alunarbeach commented Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Dec 19, 2016

Uh oh!

alunarbeach commented Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gauravshah commented Jan 16, 2017

Uh oh!

cloud-fan Jan 16, 2017

Choose a reason for hiding this comment

Uh oh!

rxin Jan 16, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 17, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 17, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 17, 2017

Uh oh!

rxin commented Jan 17, 2017

Uh oh!

alunarbeach commented Jan 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alunarbeach commented Dec 19, 2016 •

edited

Loading

alunarbeach commented Dec 19, 2016 •

edited

Loading