-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-27619][SQL]MapType should be prohibited in hash expressions #27580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
OK to test |
|
sounds reasonable, cc @hvanhovell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error message should mention the legacy config and how to restore the old behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also warn users about the consequence: logically same maps may have different hashcode.
|
ok to test |
|
Test build #118432 has finished for PR 27580 at commit
|
|
Test build #118466 has finished for PR 27580 at commit
|
|
Retest this please. |
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: leading s seems unnecessary in this line and the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it doesn't need to be protected at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the doc at the end. It has to be in the next line.
|
All review comments are handled. Please check once. @HyukjinKwon @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spark.sql.legacy.useHashOnMapType -> SQLConf.LEGACY_USE_HASH_ON_MAPTYPE.key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used spark.sql.legacy.useHashOnMapType to maintain consistency with what i updated in migration guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not hardcode the config name. Better to use ${SQLConf.LEGACY_USE_HASH_ON_MAPTYPE.key}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: true.toString() -> "true"
|
Test build #118532 has finished for PR 27580 at commit
|
|
Test build #118536 has finished for PR 27580 at commit
|
|
retest this please |
|
Test build #118554 has finished for PR 27580 at commit
|
|
gentle ping @cloud-fan @HyukjinKwon @maropu |
|
Test build #118662 has finished for PR 27580 at commit
|
|
retest this please. |
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #118778 has finished for PR 27580 at commit
|
|
gentle ping @maropu @HyukjinKwon @cloud-fan |
| if (children.length < 1) { | ||
| TypeCheckResult.TypeCheckFailure( | ||
| s"input to function $prettyName requires at least one argument") | ||
| } else if (children.forall(child => hasMapType(child.dataType)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forall -> exists?
| } else if (children.forall(child => hasMapType(child.dataType)) && | ||
| !SQLConf.get.getConf(SQLConf.LEGACY_USE_HASH_ON_MAPTYPE)) { | ||
| TypeCheckResult.TypeCheckFailure( | ||
| s"input to function $prettyName cannot contain elements of MapType. Logically same maps " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically -> In Spark,
|
Test build #118920 has finished for PR 27580 at commit
|
|
@iRakson It seems a valid test failure. |
|
Test build #118946 has finished for PR 27580 at commit
|
|
retest this please |
|
Test build #118960 has finished for PR 27580 at commit
|
|
thanks, merging to master/3.0! |
### What changes were proposed in this pull request?
`hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour.
When `spark.sql.legacy.useHashOnMapType` is set to false:
```
scala> spark.sql("select hash(map())");
org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7;
'Project [unresolvedalias(hash(map(), 42), None)]
+- OneRowRelation
```
when `spark.sql.legacy.useHashOnMapType` is set to true :
```
scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true");
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> spark.sql("select hash(map())").first()
res4: org.apache.spark.sql.Row = [42]
```
### Why are the changes needed?
As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :
```
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())
// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))
```
Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236.
### Does this PR introduce any user-facing change?
Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true.
### How was this patch tested?
UT added.
Closes apache#27580 from iRakson/SPARK-27619.
Authored-by: iRakson <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c913b9d)
Signed-off-by: Wenchen Fan <[email protected]>
|
|
||
| - Since Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, e.g. `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, while casting string to integrals and booleans, it will not trim the whitespaces from both ends, the foregoing results will be `null`, while to datetimes, only the trailing spaces (= ASCII 32) will be removed. | ||
|
|
||
| - Since Spark 3.0, An analysis exception will be thrown when hash expressions are applied on elements of MapType. To restore the behavior before Spark 3.0, set `spark.sql.legacy.useHashOnMapType` to true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit.
`An` -> `an`
to true -> to `true`
| TypeCheckResult.TypeCheckFailure( | ||
| s"input to function $prettyName requires at least one argument") | ||
| } else if (children.exists(child => hasMapType(child.dataType)) && | ||
| !SQLConf.get.getConf(SQLConf.LEGACY_USE_HASH_ON_MAPTYPE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun should i raise a followup for these changes?
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val LEGACY_USE_HASH_ON_MAPTYPE = buildConf("spark.sql.legacy.useHashOnMapType") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, allowHashOnMapType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah "allow" is more precise. @iRakson feel free to send a followup to address all the comments.
|
+1, late LGTM. You can ignore the above comment because those are too minor, @iRakson . |
### What changes were proposed in this pull request?
`hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour.
When `spark.sql.legacy.useHashOnMapType` is set to false:
```
scala> spark.sql("select hash(map())");
org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7;
'Project [unresolvedalias(hash(map(), 42), None)]
+- OneRowRelation
```
when `spark.sql.legacy.useHashOnMapType` is set to true :
```
scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true");
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> spark.sql("select hash(map())").first()
res4: org.apache.spark.sql.Row = [42]
```
### Why are the changes needed?
As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :
```
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())
// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))
```
Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236.
### Does this PR introduce any user-facing change?
Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true.
### How was this patch tested?
UT added.
Closes apache#27580 from iRakson/SPARK-27619.
Authored-by: iRakson <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
hash()andxxhash64()cannot be used on elements ofMaptype. A new configurationspark.sql.legacy.useHashOnMapTypeis introduced to allow users to restore the previous behaviour.When
spark.sql.legacy.useHashOnMapTypeis set to false:when
spark.sql.legacy.useHashOnMapTypeis set to true :Why are the changes needed?
As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :
Also
MapTypeis prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236.Does this PR introduce any user-facing change?
Yes. Now users cannot use hash functions on elements of
mapType. To restore the previous behaviour setspark.sql.legacy.useHashOnMapTypeto true.How was this patch tested?
UT added.