Skip to content

Conversation

@alunarbeach
Copy link

@alunarbeach alunarbeach commented Dec 19, 2016

What changes were proposed in this pull request?

Currently saving a dataframe in append mode lists all leaf files in save directory. When the directory is in object stores object stores (S3 / Google Storage) and has many subfolders due to partitioning, the writes are taking a long time to write or they result in read time out.
This pull request introduces a skip flag that is false by default and can be enabled by users to skip partition checking.

How was this patch tested?

This patch was tested using manual tests and regression tests.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@alunarbeach
Copy link
Author

alunarbeach commented Dec 19, 2016

@dongjoon-hyun @rxin @cloud-fan @tdas will you be able to review this?

@Gauravshah
Copy link
Contributor

should help us save 20 mins on each iteration scanning directories.


// If we are appending to a table that already exists, make sure the partitioning matches
// up. If we fail to load the table for whatever reason, ignore the check.
if (mode == SaveMode.Append) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just remove this check? it's too expensive. cc @marmbrus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to remove in the case of files. Can we keep the track for catalog tables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we can remove the parm justPartitioning from the function getOrInferFileFormatSchema

@cloud-fan
Copy link
Contributor

cc @alunarbeach can you update this PR? basically we don't need to add a flag but just remove that check.

@rxin
Copy link
Contributor

rxin commented Jan 17, 2017

I submitted a pr here #16622

@asfgit asfgit closed this in 83dff87 Jan 17, 2017
@alunarbeach
Copy link
Author

Thanks Team. Deleting the branch.

@alunarbeach alunarbeach deleted the spark-18917 branch January 18, 2017 14:55
liancheng pushed a commit to liancheng/spark that referenced this pull request Jan 25, 2017
In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check.

Note that for catalog tables, we always do the check, as discussed in apache#16339 (comment)

backport apache#16622 to our internal 2.1 branch.

Author: Reynold Xin <[email protected]>

Closes apache#178 from cloud-fan/backport.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check.

Note that for catalog tables, we always do the check, as discussed in apache#16339 (comment)

## How was this patch tested?
N/A

Closes apache#16339.

Author: Reynold Xin <[email protected]>

Closes apache#16622 from rxin/SPARK-18917.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check.

Note that for catalog tables, we always do the check, as discussed in apache#16339 (comment)

## How was this patch tested?
N/A

Closes apache#16339.

Author: Reynold Xin <[email protected]>

Closes apache#16622 from rxin/SPARK-18917.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants