-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33386][SQL] Accessing array elements in ElementAt/Elt/GetArrayItem should failed if index is out of bound #30297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #130781 has finished for PR 30297 at commit
|
|
cc @maropu |
|
I think we need to update the ANSI doc accordingly if we change the behaviours: https://github.com/apache/spark/blame/master/docs/sql-ref-ansi-compliance.md#L110-L114 btw, is this change related to ANSI? The ANSI standard says something about these kinds of behaviours? @gengliangwang |
|
Also, its better to update the doc of the ANSI config (e.g., spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Lines 2145 to 2149 in 90f6f39
|
| } | ||
| assert(ex.getMessage.contains("Invalid index: 4")) | ||
| } else { | ||
| checkAnswer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you put this in a single line?
checkAnswer(df, Row(null))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests for -1 or 0?
| } | ||
| assert(ex.getMessage.contains("Invalid index: 5")) | ||
| } else { | ||
| checkAnswer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add fine-grained tests for element_at and elt via checkExceptionInExpression?
| """, | ||
| since = "2.4.0") | ||
| case class ElementAt(left: Expression, right: Expression) | ||
| case class ElementAt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the usage above of ExpressionDescription , too?
| since = "2.0.0") | ||
| // scalastyle:on line.size.limit | ||
| case class Elt(children: Seq[Expression]) extends Expression { | ||
| case class Elt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Seq(true, false).foreach { ansiEnabled => | ||
| withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) { | ||
| val typeA = ArrayType(StringType) | ||
| val array = Literal.create(Seq("a", "b"), typeA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: val array = Literal.create(Seq("a", "b"), ArrayType(StringType))?
| val ex = intercept[Exception] { | ||
| checkEvaluation(GetArrayItem(array, Literal(5)), null) | ||
| } | ||
| assert(stackTraceToString(ex).contains("Invalid index: 5")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests for -1?
|
|
||
| if (ansiEnabled) { | ||
| val ex = intercept[Exception] { | ||
| checkEvaluation(GetArrayItem(array, Literal(5)), null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use checkExceptionInExpression instead?
Change-Id: If7650cc45aa30fd3d6549536a8af6ca01a746c39
2. add more UT scenario. 3. update nullability calculation for ElementAt and update UT. Change-Id: Ieec69fe5171dfee2863ad93f84bb660076214de0
d1e0ae3 to
7dcba56
Compare
|
@maropu updated with all your comments, plz have a look when you have time. ^_^ |
| accesses elements from the last to the first. Returns NULL if the index exceeds the length | ||
| of the array. | ||
| accesses elements from the last to the first. If the index exceeds the length of the array, | ||
| Returns NULL if Ansi mode is off; Throws ArrayIndexOutOfBoundsException when Ansi mode is on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ansi -> ANSI
| false | ||
| } else if (elements.length < math.abs(ordinal)) { | ||
| true | ||
| if (failOnError) false else true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: !failOnFalse
| if ($index >= $eval1.numElements() || $index < 0$nullCheck) { | ||
| if ($index >= $eval1.numElements() || $index < 0) { | ||
| $failOnErrorBranch | ||
| } else if (false$nullCheck) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false$nullCheck -> $nullCheck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can get rid of the entire else if (false$nullCheck) if containsNull == false.
| usage = "_FUNC_(n, input1, input2, ...) - Returns the `n`-th input, e.g., returns `input2` when `n` is 2.", | ||
| usage = """ | ||
| _FUNC_(n, input1, input2, ...) - Returns the `n`-th input, e.g., returns `input2` when `n` is 2. | ||
| If the index exceeds the length of the array, Returns NULL if Ansi mode is off; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, Ansi -> ANSI
| "field. 2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in " + | ||
| "the SQL parser.") | ||
| "the SQL parser. 3. Spark will returns null for null input for function `size`. " + | ||
| "4. Spark will throw ArrayIndexOutOfBoundsException if invalid indices " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge 1 and 4: 1. Spark will throw an exception at runtime if the inputs to a SQL operator/function are invalid, e.g. overflow in arithmetic operations, out-of-range index when accessing array elements.
|
|
||
| // CreateArray case invalid indices | ||
| assert(!ElementAt(array, Literal(0)).nullable) | ||
| if (ansiEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assert(ElementAt(array, Literal(4)).nullable == !ansiEnabled)
|
|
||
| // GetArrayStructFields case invalid indices | ||
| assert(!ElementAt(stArray3, Literal(0)).nullable) | ||
| if (ansiEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| Row(Map(Seq(Row(1, "a")) -> 1))) | ||
| } | ||
|
|
||
| test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not mix unit test and end-to-end test. Let's add UT in CollectionExpressionsSuite(for ElementAt) and StringExpressionsSuite (for Elt).
For end-to-end test, we can put them in DataFrameSuite for all the three functions/operators.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130868 has finished for PR 30297 at commit
|
| checkAnswer(df, Row("2", "2")) | ||
| } | ||
|
|
||
| test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we follow datetime.sql, which is run twice with and without ANSI mode? array.sql may be a good place to put the new end-to-end tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @cloud-fan 's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move to array.sql, SPARK-33391 will be removed accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me too.
|
@leanken could you update the PR description. I think there is user-facing change when ANSI mode is enabled. |
| Seq(Date.valueOf("2018-01-01"))) | ||
| } | ||
|
|
||
| test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, SPARK-33386 is better than SPARK-33391 because this test name has ArrayIndexOutOfBoundsException.
| checkEvaluation(GetArrayItem(nestedArray, Literal(0)), Seq(1)) | ||
| } | ||
|
|
||
| test("SPARK-33391: GetArrayItem ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto SPARK-33386.
| Sentences(Literal("\"quote"), Literal("\"quote"), Literal("\"quote")) :: Nil) | ||
| } | ||
|
|
||
| test("SPARK-33391: elt ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto SPARK-33386.
| } | ||
|
|
||
| val failOnErrorBranch = if (failOnError) { | ||
| s"""throw new ArrayIndexOutOfBoundsException("Invalid index: " + $index);""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove .stripMargin because this is a single line?
| } | ||
|
|
||
| val failOnErrorBranch = if (failOnError) { | ||
| s"""throw new ArrayIndexOutOfBoundsException("Invalid index: " + $index);""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove .stripMargin because this is a single line?
| If `spark.sql.ansi.enabled` is set to true, it throws ArrayIndexOutOfBoundsException | ||
| for invalid indices. | ||
| _FUNC_(map, key) - Returns value for given key, or NULL if the key is not contained in the map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make ElementAtbehavior consistent on map type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's about to support ANSI mode for map type in next PR.
| if (array.numElements() < math.abs(index)) { | ||
| null | ||
| if (failOnError) { | ||
| throw new ArrayIndexOutOfBoundsException(s"Invalid index: $index") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we include the total number of elements too in the error message? sometimes that is helpful for debugging.
| "throw an exception at runtime if the inputs to a SQL operator/function are invalid, " + | ||
| "e.g. overflow in arithmetic operations, out-of-range index when accessing array elements. " + | ||
| "2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in " + | ||
| "the SQL parser. 3. Spark will returns null for null input for function `size`.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will returns -> will return, and perhaps capital NULL?
2. move DataFrameSuite code into array.sql and ansi/array.sql 3. add numElements to exception message. 4. other code refine. Change-Id: Ieb322ed7b036fc3322fd3b814c8508bfef266378
|
@sunchao @dongjoon-hyun @cloud-fan updated, resolved your comments. |
updated. |
|
Kubernetes integration test starting |
| "" | ||
| } | ||
|
|
||
| val failOnErrorBranch = if (failOnError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it should be indexOutOfBoundBranch
| "" | ||
| } | ||
|
|
||
| val failOnErrorBranch = if (failOnError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it should be indexOutOfBoundBranch
| """.stripMargin | ||
| }.mkString) | ||
|
|
||
| val failOnErrorBranch = if (failOnError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it should be indexOutOfBoundBranch
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130962 has finished for PR 30297 at commit
|
|
Test build #130968 has finished for PR 30297 at commit
|
|
GA passed, merging to master, thanks! |
What changes were proposed in this pull request?
Instead of returning NULL, throws runtime ArrayIndexOutOfBoundsException when ansiMode is enable for
element_at,elt,GetArrayItemfunctions.Why are the changes needed?
For ansiMode.
Does this PR introduce any user-facing change?
When
spark.sql.ansi.enabled= true, Spark will throwArrayIndexOutOfBoundsExceptionif out-of-range index when accessing array elementsHow was this patch tested?
Added UT and existing UT.