[SPARK-27619][SQL]MapType should be prohibited in hash expressions #27580

iRakson · 2020-02-14T12:00:28Z

What changes were proposed in this pull request?

hash() and xxhash64() cannot be used on elements of Maptype. A new configuration spark.sql.legacy.useHashOnMapType is introduced to allow users to restore the previous behaviour.

When spark.sql.legacy.useHashOnMapType is set to false:

scala> spark.sql("select hash(map())");
org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7;
'Project [unresolvedalias(hash(map(), 42), None)]
+- OneRowRelation

when spark.sql.legacy.useHashOnMapType is set to true :

scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true");
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select hash(map())").first()
res4: org.apache.spark.sql.Row = [42]

Why are the changes needed?

As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :

val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))

Also MapType is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236.

Does this PR introduce any user-facing change?

Yes. Now users cannot use hash functions on elements of mapType. To restore the previous behaviour set spark.sql.legacy.useHashOnMapType to true.

How was this patch tested?

UT added.

iRakson · 2020-02-14T12:01:39Z

cc @dongjoon-hyun @cloud-fan @HyukjinKwon

cloud-fan · 2020-02-14T12:05:48Z

OK to test

cloud-fan · 2020-02-14T12:07:10Z

sounds reasonable, cc @hvanhovell

cloud-fan · 2020-02-14T12:07:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

the error message should mention the legacy config and how to restore the old behavior

cloud-fan · 2020-02-14T12:26:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

we should also warn users about the consequence: logically same maps may have different hashcode.

cloud-fan · 2020-02-14T13:58:24Z

ok to test

SparkQA · 2020-02-14T18:16:08Z

Test build #118432 has finished for PR 27580 at commit 7a66814.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-15T08:05:02Z

Test build #118466 has finished for PR 27580 at commit 944fac1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

iRakson · 2020-02-15T08:11:49Z

Retest this please.

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

HyukjinKwon · 2020-02-17T02:13:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

nit: leading s seems unnecessary in this line and the next line.

HyukjinKwon · 2020-02-17T02:16:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

Seems like it doesn't need to be protected at this moment.

HyukjinKwon · 2020-02-17T02:17:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

nit: the doc at the end. It has to be in the next line.

iRakson · 2020-02-17T05:11:02Z

All review comments are handled. Please check once. @HyukjinKwon @cloud-fan

maropu · 2020-02-17T05:54:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

nit: spark.sql.legacy.useHashOnMapType -> SQLConf.LEGACY_USE_HASH_ON_MAPTYPE.key

I used spark.sql.legacy.useHashOnMapType to maintain consistency with what i updated in migration guide.

We should not hardcode the config name. Better to use ${SQLConf.LEGACY_USE_HASH_ON_MAPTYPE.key}

maropu · 2020-02-17T05:55:24Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

nit: true.toString() -> "true"

SparkQA · 2020-02-17T08:05:02Z

Test build #118532 has finished for PR 27580 at commit 477e154.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-17T08:05:02Z

Test build #118536 has finished for PR 27580 at commit 3bcc470.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-17T08:15:37Z

retest this please

SparkQA · 2020-02-17T13:16:46Z

Test build #118554 has finished for PR 27580 at commit 3bcc470.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

iRakson · 2020-02-18T12:17:43Z

gentle ping @cloud-fan @HyukjinKwon @maropu

SparkQA · 2020-02-19T08:05:02Z

Test build #118662 has finished for PR 27580 at commit ab27a0f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

iRakson · 2020-02-19T08:46:05Z

retest this please.

iRakson · 2020-02-21T06:18:11Z

retest this please

cloud-fan · 2020-02-21T09:18:00Z

retest this please

SparkQA · 2020-02-21T14:24:58Z

Test build #118778 has finished for PR 27580 at commit ab27a0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

iRakson · 2020-02-24T08:37:07Z

gentle ping @maropu @HyukjinKwon @cloud-fan

cloud-fan · 2020-02-25T12:00:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

    if (children.length < 1) {
      TypeCheckResult.TypeCheckFailure(
        s"input to function $prettyName requires at least one argument")
+    } else if (children.forall(child => hasMapType(child.dataType)) &&


forall -> exists?

cloud-fan · 2020-02-25T12:01:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

+    } else if (children.forall(child => hasMapType(child.dataType)) &&
+      !SQLConf.get.getConf(SQLConf.LEGACY_USE_HASH_ON_MAPTYPE)) {
+      TypeCheckResult.TypeCheckFailure(
+        s"input to function $prettyName cannot contain elements of MapType. Logically same maps " +


Logically -> In Spark,

SparkQA · 2020-02-25T16:20:45Z

Test build #118920 has finished for PR 27580 at commit 5c5ae20.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-02-25T23:33:10Z

@iRakson It seems a valid test failure.

SparkQA · 2020-02-26T08:05:02Z

Test build #118946 has finished for PR 27580 at commit 8dfee67.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-02-26T08:19:40Z

retest this please

SparkQA · 2020-02-26T13:14:51Z

Test build #118960 has finished for PR 27580 at commit 8dfee67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-26T17:48:28Z

thanks, merging to master/3.0!

### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes apache#27580 from iRakson/SPARK-27619. Authored-by: iRakson <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c913b9d) Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2020-02-27T05:06:04Z

docs/sql-migration-guide.md


  - Since Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, e.g. `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, while casting string to integrals and booleans, it will not trim the whitespaces from both ends, the foregoing results will be `null`, while to datetimes, only the trailing spaces (= ASCII 32) will be removed.

+  - Since Spark 3.0, An analysis exception will be thrown when hash expressions are applied on elements of MapType. To restore the behavior before Spark 3.0, set `spark.sql.legacy.useHashOnMapType` to true.


nit.

`An` -> `an` to true -> to `true`

dongjoon-hyun · 2020-02-27T05:06:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

      TypeCheckResult.TypeCheckFailure(
        s"input to function $prettyName requires at least one argument")
+    } else if (children.exists(child => hasMapType(child.dataType)) &&
+      !SQLConf.get.getConf(SQLConf.LEGACY_USE_HASH_ON_MAPTYPE)) {


indentation.

@dongjoon-hyun should i raise a followup for these changes?

dongjoon-hyun · 2020-02-27T05:07:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .booleanConf
      .createWithDefault(false)

+  val LEGACY_USE_HASH_ON_MAPTYPE = buildConf("spark.sql.legacy.useHashOnMapType")


Maybe, allowHashOnMapType?

ah "allow" is more precise. @iRakson feel free to send a followup to address all the comments.

dongjoon-hyun · 2020-02-27T05:20:18Z

+1, late LGTM. You can ignore the above comment because those are too minor, @iRakson .

### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes apache#27580 from iRakson/SPARK-27619. Authored-by: iRakson <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan reviewed Feb 14, 2020

View reviewed changes

iRakson requested a review from cloud-fan February 14, 2020 12:21

cloud-fan reviewed Feb 14, 2020

View reviewed changes

iRakson requested a review from cloud-fan February 14, 2020 13:08

HyukjinKwon reviewed Feb 17, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 17, 2020

View reviewed changes

iRakson requested a review from HyukjinKwon February 17, 2020 05:09

maropu reviewed Feb 17, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated

Copy link

Member

maropu Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: true.toString() -> "true"

iRakson requested a review from maropu February 17, 2020 06:09

iRakson added 5 commits February 19, 2020 11:13

[SPARK-27619][SQL]MapType should be prohibited in hash expressions

211982b

review comment fix

451a622

review comment fix

1eb70f2

fix UT

06b28e5

fix review comments

068d76f

fix review comments

ab27a0f

iRakson force-pushed the SPARK-27619 branch from 3bcc470 to ab27a0f Compare February 19, 2020 05:44

HyukjinKwon approved these changes Feb 24, 2020

View reviewed changes

cloud-fan reviewed Feb 25, 2020

View reviewed changes

Fix

5c5ae20

iRakson requested a review from cloud-fan February 25, 2020 12:30

cloud-fan approved these changes Feb 25, 2020

View reviewed changes

Fix UT

8dfee67

cloud-fan closed this in c913b9d Feb 26, 2020

dongjoon-hyun reviewed Feb 27, 2020

View reviewed changes


		- Since Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, e.g. `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, while casting string to integrals and booleans, it will not trim the whitespaces from both ends, the foregoing results will be `null`, while to datetimes, only the trailing spaces (= ASCII 32) will be removed.

		- Since Spark 3.0, An analysis exception will be thrown when hash expressions are applied on elements of MapType. To restore the behavior before Spark 3.0, set `spark.sql.legacy.useHashOnMapType` to true.

Uh oh!

[SPARK-27619][SQL]MapType should be prohibited in hash expressions #27580

[SPARK-27619][SQL]MapType should be prohibited in hash expressions #27580

Uh oh!

Conversation

iRakson commented Feb 14, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

iRakson commented Feb 14, 2020

Uh oh!

cloud-fan commented Feb 14, 2020

Uh oh!

cloud-fan commented Feb 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 15, 2020

Uh oh!

iRakson commented Feb 15, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iRakson commented Feb 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 17, 2020

Uh oh!

SparkQA commented Feb 17, 2020

Uh oh!

cloud-fan commented Feb 17, 2020

Uh oh!

SparkQA commented Feb 17, 2020

Uh oh!

iRakson commented Feb 18, 2020

Uh oh!

SparkQA commented Feb 19, 2020

Uh oh!

iRakson commented Feb 19, 2020

Uh oh!

iRakson commented Feb 21, 2020

Uh oh!

cloud-fan commented Feb 21, 2020

Uh oh!

SparkQA commented Feb 21, 2020

Uh oh!

iRakson commented Feb 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 25, 2020

Uh oh!

maropu commented Feb 25, 2020

Uh oh!

SparkQA commented Feb 26, 2020

Uh oh!

maropu commented Feb 26, 2020

Uh oh!

dongjoon-hyun commented Feb 27, 2020 •

edited

Loading