[SPARK-18635] [SQL] Partition name/values not escaped correctly in some cases #16071

ericl · 2016-11-30T01:47:54Z

What changes were proposed in this pull request?

Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.

To my understanding this is how values, filesystem paths, and URIs interact.

Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions.
Internally, we convert these raw values to filesystem paths via ExternalCatalogUtils.[un]escapePathName.
In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via path.toURI, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call new Path(new URI(uriTxt)) in order to decode the URI-encoded string.
In CatalogStorageFormat we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.
Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally.

In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.

cc @mallman @cloud-fan @yhuai

How was this patch tested?

Unit tests.

ericl · 2016-11-30T01:48:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

+//          spark.sql(s"""
+//            |alter table test partition (A=0, B='%')
+//            |rename to partition (A=100, B='%')""".stripMargin)
+//          assert(spark.sql("select * from test where a = 100").count() == 1)


@cloud-fan this seems to crash in some of the post-processing code inHiveExternalCatalog:renamePartitions. Any ideas there?

ericl · 2016-11-30T01:49:39Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

+
+        // TODO(ekl) fix overwrite table
+//        spark.sql("show partitions test").show(false)
+//        spark.sql("insert overwrite table test partition (a, b) select id, id, '%' from range(1)")


This crashes in the hive client. Not sure why, it might be a hive bug when deleting partitions that have been overwritten.

SparkQA · 2016-11-30T04:28:24Z

Test build #69370 has finished for PR 16071 at commit 4162229.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2016-11-30T18:53:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

+          spark.sql(s"""
+            |alter table test partition (A=0, B='%')
+            |set location '${dir.getAbsolutePath}'""".stripMargin)
+          assert(spark.sql("select * from test").count() == 29)  // missing 1


Why is this 29 instead of 30?

Since the partition was moved, there are no files in the new location. Hence -1 file

Ah. I see. So Hive/Spark doesn't move partition data when the partition location is changed. Might be clearer to call that out in your comment rather than just "missing 1".

mallman · 2016-11-30T19:06:42Z

I can't vouch for how Path and URI work together to do the right thing, however the test coverage looks good. LGTM overall.

ericl · 2016-11-30T21:25:10Z

Ok I've figured out issues with the other test cases, they seem to be due to a separate bug: https://issues.apache.org/jira/browse/SPARK-18659

SparkQA · 2016-12-01T00:15:22Z

Test build #69431 has finished for PR 16071 at commit 1bd10ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…e cases ## What changes were proposed in this pull request? Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases. To my understanding this is how values, filesystem paths, and URIs interact. - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions. - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`. - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string. - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path. - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally. In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters. cc mallman cloud-fan yhuai ## How was this patch tested? Unit tests. Author: Eric Liang <[email protected]> Closes #16071 from ericl/spark-18635. (cherry picked from commit 88f559f) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2016-12-01T08:50:35Z

LGTM, merging to master/2.1!

…e cases ## What changes were proposed in this pull request? Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases. To my understanding this is how values, filesystem paths, and URIs interact. - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions. - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`. - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string. - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path. - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally. In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters. cc mallman cloud-fan yhuai ## How was this patch tested? Unit tests. Author: Eric Liang <[email protected]> Closes apache#16071 from ericl/spark-18635.

Tue Nov 29 17:38:10 PST 2016

4162229

ericl commented Nov 30, 2016

View reviewed changes

ericl force-pushed the spark-18635 branch from 0dec864 to 4162229 Compare November 30, 2016 01:54

ericl changed the title ~~[SPARK-18635] [SQL] Partition name/values not escaped correctly in some cases~~ [SPARK-18635] [SQL] [WIP] Partition name/values not escaped correctly in some cases Nov 30, 2016

mallman reviewed Nov 30, 2016

View reviewed changes

Wed Nov 30 13:24:29 PST 2016

1bd10ba

ericl changed the title ~~[SPARK-18635] [SQL] [WIP] Partition name/values not escaped correctly in some cases~~ [SPARK-18635] [SQL] Partition name/values not escaped correctly in some cases Nov 30, 2016

asfgit closed this in 88f559f Dec 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18635] [SQL] Partition name/values not escaped correctly in some cases #16071

[SPARK-18635] [SQL] Partition name/values not escaped correctly in some cases #16071

Uh oh!

ericl commented Nov 30, 2016

Uh oh!

ericl Nov 30, 2016

Uh oh!

ericl Nov 30, 2016 •

edited

Loading

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

mallman Nov 30, 2016

Uh oh!

ericl Nov 30, 2016

Uh oh!

mallman Nov 30, 2016

Uh oh!

mallman commented Nov 30, 2016

Uh oh!

ericl commented Nov 30, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

cloud-fan commented Dec 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-18635] [SQL] Partition name/values not escaped correctly in some cases #16071

[SPARK-18635] [SQL] Partition name/values not escaped correctly in some cases #16071

Uh oh!

Conversation

ericl commented Nov 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ericl Nov 30, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

mallman Nov 30, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Nov 30, 2016

Choose a reason for hiding this comment

Uh oh!

mallman Nov 30, 2016

Choose a reason for hiding this comment

Uh oh!

mallman commented Nov 30, 2016

Uh oh!

ericl commented Nov 30, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

cloud-fan commented Dec 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericl Nov 30, 2016 •

edited

Loading