Skip to content

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Oct 14, 2022

What changes were proposed in this pull request?

extract the check insertion field cast methold so that we can do validate patition value at PartitioningUtils.normalizePartitionSpec

Why are the changes needed?

Insertion follow the behavior of config spark.sql.storeAssignmentPolicy, which will fail if the value can not cast to target data type by default. Alter partition should also follow it. For example:

CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);

-- This DDL should fail but worked:
ALTER TABLE t ADD PARTITION(p='aaa'); 

-- FAILED which follows spark.sql.storeAssignmentPolicy
INSERT INTO t PARTITION(p='aaa') SELECT 1

Does this PR introduce any user-facing change?

yes, the added partition value will follow the behavior of storeAssignmentPolicy.

To restore the previous behavior, set spark.sql.legacy.skipPartitionSpecTypeValidation = true;

How was this patch tested?

add test

@HyukjinKwon
Copy link
Member

cc @gengliangwang

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change partition column from int type to string to make the test work

@ulysses-you
Copy link
Contributor Author

also cc @MaxGekk @cloud-fan

@gengliangwang
Copy link
Member

@ulysses-you Thanks for the ping. Could you add a legacy flag and migration guide?

@ulysses-you
Copy link
Contributor Author

@gengliangwang added legacy config and migration guide

@github-actions github-actions bot added the DOCS label Oct 18, 2022
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Since Spark 3.4, Spark will do validation for partition spec follows the behavior of `spark.sql.storeAssignmentPolicy` which may cause exception if type conversion error, e.g. `ALTER TABLE .. ADD PARTITION(p='a')` if column p is int type. To restore the legacy behavior, set `spark.sql.legacy.skipPartitionSpecTypeValidation` to `true`.
- Since Spark 3.4, Spark will do validation for partition spec in ADD PARTITION to follow the behavior of `spark.sql.storeAssignmentPolicy` which may cause an exception if type conversion fails, e.g. `ALTER TABLE .. ADD PARTITION(p='a')` if column `p` is int type. To restore the legacy behavior, set `spark.sql.legacy.skipPartitionSpecTypeValidation` to `true`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.doc("When true, skip do validation for partition spec. E.g., " +
.doc("When true, skip validation for partition spec in ADD PARTITION. E.g., " +

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is weird as the input is already a String. How about normalizePartitionStringValue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use checkError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val SKIP_PARTITION_SPEC_TYPE_VALIDATION =
val SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION =

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buildConf("spark.sql.legacy.skipPartitionSpecTypeValidation")
buildConf("spark.sql.legacy.skipTypeValidationOnAlterPartition")

@ulysses-you
Copy link
Contributor Author

@cloud-fan @gengliangwang addressed comments

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9140795 Oct 24, 2022
@ulysses-you ulysses-you deleted the verify-partition branch October 27, 2022 02:24
HyukjinKwon added a commit that referenced this pull request Oct 31, 2022
…BLE PARTITION test

### What changes were proposed in this pull request?

This PR is a followup of #38257 that fixes the test case of v2 when ANSI is enabled.

When ANSI mode is enabled, it fails (https://github.com/apache/spark/actions/runs/3353657619/jobs/5556624621) as below:

```
- ALTER TABLE .. ADD PARTITION using V2 catalog V2 command: SPARK-40798: Alter partition should verify partition value - legacy *** FAILED *** (22 milliseconds)
  org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value 'aaa' of the type "STRING" cannot be cast to "INT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== SQL(line 1, position 1) ==
ALTER TABLE test_catalog.ns.tbl ADD PARTITION (p='aaa')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  at org.apache.spark.sql.errors.QueryExecutionErrors$.invalidInputInCastToNumberError(QueryExecutionErrors.scala:164)
  at org.apache.spark.sql.catalyst.util.UTF8StringUtils$.withException(UTF8StringUtils.scala:51)
  at org.apache.spark.sql.catalyst.util.UTF8StringUtils$.toIntExact(UTF8StringUtils.scala:34)
  at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$2(Cast.scala:927)
  at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$2$adapted(Cast.scala:927)
  at org.apache.spark.sql.catalyst.expressions.Cast.buildCast(Cast.scala:588)
  at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$1(Cast.scala:927)
  at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:1285)
  at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:526)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$.$anonfun$convertToPartIdent$1(ResolvePartitionSpec.scala:81)
  at scala.collection.immutable.List.map(List.scala:293)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$.convertToPartIdent(ResolvePartitionSpec.scala:78)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$.org$apache$spark$sql$catalyst$analysis$ResolvePartitionSpec$$resolvePartitionSpec(ResolvePartitionSpec.scala:71)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$$anonfun$apply$2$$anonfun$applyOrElse$1.applyOrElse(ResolvePartitionSpec.scala:48)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$$anonfun$apply$2$$anonfun$applyOrElse$1.applyOrElse(ResolvePartitionSpec.scala:41)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:159)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:200)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:200)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:211)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:216)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at scala.collection.TraversableLike.map(TraversableLike.scala:286)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
```

### Why are the changes needed?

To recover the broken ANSI build.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually ran this test case locally.

Closes #38443 from HyukjinKwon/SPARK-40798-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
- Valid hexadecimal strings should include only allowed symbols (0-9A-Fa-f).
- Valid values for `fmt` are case-insensitive `hex`, `base64`, `utf-8`, `utf8`.
- Since Spark 3.4, Spark throws only `PartitionsAlreadyExistException` when it creates partitions but some of them exist already. In Spark 3.3 or earlier, Spark can throw either `PartitionsAlreadyExistException` or `PartitionAlreadyExistsException`.
- Since Spark 3.4, Spark will do validation for partition spec in ALTER PARTITION to follow the behavior of `spark.sql.storeAssignmentPolicy` which may cause an exception if type conversion fails, e.g. `ALTER TABLE .. ADD PARTITION(p='a')` if column `p` is int type. To restore the legacy behavior, set `spark.sql.legacy.skipPartitionSpecTypeValidation` to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conf should be spark.sql.legacy.skipTypeValidationOnAlterPartition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just made a PR :-) #38667

HyukjinKwon added a commit that referenced this pull request Nov 16, 2022
…t migration guide

### What changes were proposed in this pull request?

This PR is a followup of #38257 to fix a typo from `spark.sql.legacy.skipPartitionSpecTypeValidation` to `spark.sql.legacy.skipTypeValidationOnAlterPartition`.

### Why are the changes needed?

To show users the correct configuration name for legacy behaviours.

### Does this PR introduce _any_ user-facing change?

No, doc-only.

### How was this patch tested?

N/A

Closes #38667 from HyukjinKwon/SPARK-40798-followup3.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Nov 24, 2022
…t migration guide

### What changes were proposed in this pull request?

This PR is a followup of apache#38257 to fix a typo from `spark.sql.legacy.skipPartitionSpecTypeValidation` to `spark.sql.legacy.skipTypeValidationOnAlterPartition`.

### Why are the changes needed?

To show users the correct configuration name for legacy behaviours.

### Does this PR introduce _any_ user-facing change?

No, doc-only.

### How was this patch tested?

N/A

Closes apache#38667 from HyukjinKwon/SPARK-40798-followup3.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…signmentPolicy

### What changes were proposed in this pull request?

extract the check insertion field cast methold so that we can do validate patition value at PartitioningUtils.normalizePartitionSpec

### Why are the changes needed?

Insertion follow the behavior of config `spark.sql.storeAssignmentPolicy`, which will fail if the value can not cast to target data type by default. Alter partition should also follow it. For example:
```SQL
CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);

-- This DDL should fail but worked:
ALTER TABLE t ADD PARTITION(p='aaa');

-- FAILED which follows spark.sql.storeAssignmentPolicy
INSERT INTO t PARTITION(p='aaa') SELECT 1
```

### Does this PR introduce _any_ user-facing change?

yes,  the added partition value will follow the behavior of  `storeAssignmentPolicy`.

To restore the previous behavior, set spark.sql.legacy.skipPartitionSpecTypeValidation = true;

### How was this patch tested?

add test

Closes apache#38257 from ulysses-you/verify-partition.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…BLE PARTITION test

### What changes were proposed in this pull request?

This PR is a followup of apache#38257 that fixes the test case of v2 when ANSI is enabled.

When ANSI mode is enabled, it fails (https://github.com/apache/spark/actions/runs/3353657619/jobs/5556624621) as below:

```
- ALTER TABLE .. ADD PARTITION using V2 catalog V2 command: SPARK-40798: Alter partition should verify partition value - legacy *** FAILED *** (22 milliseconds)
  org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value 'aaa' of the type "STRING" cannot be cast to "INT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== SQL(line 1, position 1) ==
ALTER TABLE test_catalog.ns.tbl ADD PARTITION (p='aaa')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  at org.apache.spark.sql.errors.QueryExecutionErrors$.invalidInputInCastToNumberError(QueryExecutionErrors.scala:164)
  at org.apache.spark.sql.catalyst.util.UTF8StringUtils$.withException(UTF8StringUtils.scala:51)
  at org.apache.spark.sql.catalyst.util.UTF8StringUtils$.toIntExact(UTF8StringUtils.scala:34)
  at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$2(Cast.scala:927)
  at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$2$adapted(Cast.scala:927)
  at org.apache.spark.sql.catalyst.expressions.Cast.buildCast(Cast.scala:588)
  at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$1(Cast.scala:927)
  at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:1285)
  at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:526)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$.$anonfun$convertToPartIdent$1(ResolvePartitionSpec.scala:81)
  at scala.collection.immutable.List.map(List.scala:293)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$.convertToPartIdent(ResolvePartitionSpec.scala:78)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$.org$apache$spark$sql$catalyst$analysis$ResolvePartitionSpec$$resolvePartitionSpec(ResolvePartitionSpec.scala:71)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$$anonfun$apply$2$$anonfun$applyOrElse$1.applyOrElse(ResolvePartitionSpec.scala:48)
  at org.apache.spark.sql.catalyst.analysis.ResolvePartitionSpec$$anonfun$apply$2$$anonfun$applyOrElse$1.applyOrElse(ResolvePartitionSpec.scala:41)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:159)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:200)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:200)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:211)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:216)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at scala.collection.TraversableLike.map(TraversableLike.scala:286)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
```

### Why are the changes needed?

To recover the broken ANSI build.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually ran this test case locally.

Closes apache#38443 from HyukjinKwon/SPARK-40798-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…t migration guide

### What changes were proposed in this pull request?

This PR is a followup of apache#38257 to fix a typo from `spark.sql.legacy.skipPartitionSpecTypeValidation` to `spark.sql.legacy.skipTypeValidationOnAlterPartition`.

### Why are the changes needed?

To show users the correct configuration name for legacy behaviours.

### Does this PR introduce _any_ user-facing change?

No, doc-only.

### How was this patch tested?

N/A

Closes apache#38667 from HyukjinKwon/SPARK-40798-followup3.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
…t migration guide

### What changes were proposed in this pull request?

This PR is a followup of apache#38257 to fix a typo from `spark.sql.legacy.skipPartitionSpecTypeValidation` to `spark.sql.legacy.skipTypeValidationOnAlterPartition`.

### Why are the changes needed?

To show users the correct configuration name for legacy behaviours.

### Does this PR introduce _any_ user-facing change?

No, doc-only.

### How was this patch tested?

N/A

Closes apache#38667 from HyukjinKwon/SPARK-40798-followup3.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…t migration guide

### What changes were proposed in this pull request?

This PR is a followup of apache#38257 to fix a typo from `spark.sql.legacy.skipPartitionSpecTypeValidation` to `spark.sql.legacy.skipTypeValidationOnAlterPartition`.

### Why are the changes needed?

To show users the correct configuration name for legacy behaviours.

### Does this PR introduce _any_ user-facing change?

No, doc-only.

### How was this patch tested?

N/A

Closes apache#38667 from HyukjinKwon/SPARK-40798-followup3.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants