-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18917][SQL] Add Skip Partition Check Flag to avoid list all leaf files in append mode #16339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
@dongjoon-hyun @rxin @cloud-fan @tdas will you be able to review this? |
|
should help us save 20 mins on each iteration scanning directories. |
|
|
||
| // If we are appending to a table that already exists, make sure the partitioning matches | ||
| // up. If we fail to load the table for whatever reason, ignore the check. | ||
| if (mode == SaveMode.Append) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we just remove this check? it's too expensive. cc @marmbrus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine to remove in the case of files. Can we keep the track for catalog tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, for catalog tables, we always do the check, as it's cheap: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L92-L94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, we can remove the parm justPartitioning from the function getOrInferFileFormatSchema
|
cc @alunarbeach can you update this PR? basically we don't need to add a flag but just remove that check. |
|
I submitted a pr here #16622 |
|
Thanks Team. Deleting the branch. |
In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check. Note that for catalog tables, we always do the check, as discussed in apache#16339 (comment) backport apache#16622 to our internal 2.1 branch. Author: Reynold Xin <[email protected]> Closes apache#178 from cloud-fan/backport.
## What changes were proposed in this pull request? In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check. Note that for catalog tables, we always do the check, as discussed in apache#16339 (comment) ## How was this patch tested? N/A Closes apache#16339. Author: Reynold Xin <[email protected]> Closes apache#16622 from rxin/SPARK-18917.
## What changes were proposed in this pull request? In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check. Note that for catalog tables, we always do the check, as discussed in apache#16339 (comment) ## How was this patch tested? N/A Closes apache#16339. Author: Reynold Xin <[email protected]> Closes apache#16622 from rxin/SPARK-18917.
What changes were proposed in this pull request?
Currently saving a dataframe in append mode lists all leaf files in save directory. When the directory is in object stores object stores (S3 / Google Storage) and has many subfolders due to partitioning, the writes are taking a long time to write or they result in read time out.
This pull request introduces a skip flag that is false by default and can be enabled by users to skip partition checking.
How was this patch tested?
This patch was tested using manual tests and regression tests.