-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18635] [SQL] Partition name/values not escaped correctly in some cases #16071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // spark.sql(s""" | ||
| // |alter table test partition (A=0, B='%') | ||
| // |rename to partition (A=100, B='%')""".stripMargin) | ||
| // assert(spark.sql("select * from test where a = 100").count() == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan this seems to crash in some of the post-processing code inHiveExternalCatalog:renamePartitions. Any ideas there?
|
|
||
| // TODO(ekl) fix overwrite table | ||
| // spark.sql("show partitions test").show(false) | ||
| // spark.sql("insert overwrite table test partition (a, b) select id, id, '%' from range(1)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This crashes in the hive client. Not sure why, it might be a hive bug when deleting partitions that have been overwritten.
|
Test build #69370 has finished for PR 16071 at commit
|
| spark.sql(s""" | ||
| |alter table test partition (A=0, B='%') | ||
| |set location '${dir.getAbsolutePath}'""".stripMargin) | ||
| assert(spark.sql("select * from test").count() == 29) // missing 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this 29 instead of 30?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the partition was moved, there are no files in the new location. Hence -1 file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I see. So Hive/Spark doesn't move partition data when the partition location is changed. Might be clearer to call that out in your comment rather than just "missing 1".
|
I can't vouch for how |
|
Ok I've figured out issues with the other test cases, they seem to be due to a separate bug: https://issues.apache.org/jira/browse/SPARK-18659 |
|
Test build #69431 has finished for PR 16071 at commit
|
…e cases ## What changes were proposed in this pull request? Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases. To my understanding this is how values, filesystem paths, and URIs interact. - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions. - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`. - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string. - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path. - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally. In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters. cc mallman cloud-fan yhuai ## How was this patch tested? Unit tests. Author: Eric Liang <[email protected]> Closes #16071 from ericl/spark-18635. (cherry picked from commit 88f559f) Signed-off-by: Wenchen Fan <[email protected]>
|
LGTM, merging to master/2.1! |
…e cases ## What changes were proposed in this pull request? Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases. To my understanding this is how values, filesystem paths, and URIs interact. - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions. - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`. - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string. - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path. - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally. In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters. cc mallman cloud-fan yhuai ## How was this patch tested? Unit tests. Author: Eric Liang <[email protected]> Closes apache#16071 from ericl/spark-18635.
…e cases ## What changes were proposed in this pull request? Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases. To my understanding this is how values, filesystem paths, and URIs interact. - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions. - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`. - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string. - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path. - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally. In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters. cc mallman cloud-fan yhuai ## How was this patch tested? Unit tests. Author: Eric Liang <[email protected]> Closes apache#16071 from ericl/spark-18635.
What changes were proposed in this pull request?
Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.
To my understanding this is how values, filesystem paths, and URIs interact.
ExternalCatalogUtils.[un]escapePathName.path.toURI, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must callnew Path(new URI(uriTxt))in order to decode the URI-encoded string.CatalogStorageFormatwe store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.
cc @mallman @cloud-fan @yhuai
How was this patch tested?
Unit tests.