- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-23914][SQL] Add array_union function #21061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Test build #89317 has finished for PR 21061 at commit  
 | 
| cc @ueshin | 
| Test build #89320 has finished for PR 21061 at commit  
 | 
| Test build #89335 has finished for PR 21061 at commit  
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should implement codegen version instead of using CodegenFallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to let us know why we should implement codegen version? From the performance view, the codegen may not have advantage since union method takes longer time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, do we want to keep the chain length of the whole-stage codegen as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wholestage codegen doesn't support CodegenFallback. So even this expression codegen has no performance advantage itself, it still can makes a difference because it breaks a query to non wholestage codegen and wholestage codegen parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for your clarification
26c30b9    to
    809621b      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also support array_union and array_except by changing this 2nd loop with small other changes. This is why we introduced ArraySetUtils in this PR.
Other PRs will update ArraySetUtils appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an instance of the usage of ArraySetUtils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the final version of ArraySetUtils that supports three functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this abstraction is good or not. The final version seems complex because of a bunch of if-else.
I'd rather introduce abstract methods for the difference and override them in the subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your good suggestion.
I will create a new abstract method for this part which will be overridden by each of three subclasses
| Test build #89466 has finished for PR 21061 at commit  
 | 
| Test build #89468 has finished for PR 21061 at commit  
 | 
| Test build #89470 has finished for PR 21061 at commit  
 | 
| retest this please | 
| Test build #89474 has finished for PR 21061 at commit  
 | 
| Test build #89487 has finished for PR 21061 at commit  
 | 
| retest this please | 
| Test build #89493 has finished for PR 21061 at commit  
 | 
| Test build #89510 has finished for PR 21061 at commit  
 | 
| Test build #89515 has finished for PR 21061 at commit  
 | 
| Test build #89526 has finished for PR 21061 at commit  
 | 
| Test build #89539 has finished for PR 21061 at commit  
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we shouldn't use iterator() to avoid box/unbox. Iterator is not specialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great catch. I confirmed there is not iterator(), which is specialized, in OpenHashSet$mcI$sp`. I will rewrite this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if one of the two arguments is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I cannot see such a test case in Presto.
Let me think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, when only one of the arguments is null, unexpected TreeNodeException occurs.
checkEvaluation(ArrayUnion(a20, a30), Seq("b", "a", "c", null))
After applying rule org.apache.spark.sql.catalyst.optimizer.EliminateDistinct in batch Eliminate Distinct, the structural integrity of the plan is broken., tree:
'Project [array_union([b,a,c], [null,null]) AS Optimized(array_union([b,a,c], [null,null]))#71]
+- OneRowRelation
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: After applying rule org.apache.spark.sql.catalyst.optimizer.EliminateDistinct in batch Eliminate Distinct, the structural integrity of the plan is broken., tree:
'Project [array_union([b,a,c], [null,null]) AS Optimized(array_union([b,a,c], [null,null]))#71]
+- OneRowRelation
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:106)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
	at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvaluationWithOptimization(ExpressionEvalHelper.scala:252)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure its reason (maybe because the element types are different?), but I meant something like:
checkEvaluation(ArrayUnion(a20, Literal.create(null, ArrayType(StringType))), ...?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thanks. Your example returns null. Since the following test throws an exception, I think that it makes sense that your example returns null. WDYT?
    val df8 = Seq((null, Array("a"))).toDF("a", "b")
    intercept[AnalysisException] {
      df8.select(array_union($"a", $"b"))
    }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning null sounds good, but what do you mean by "Since the following test throws an exception"? What exception is the test throwing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following error occurs. When I looked at other tests, it does not look strange. This is because null has no type information.
cannot resolve 'array_union(NULL, `b`)' due to data type mismatch: Element type in both arrays must be the same;;
'Project [array_union(null, b#118) AS array_union(a, b)#121]
+- AnalysisBarrier
      +- Project [_1#114 AS a#117, _2#115 AS b#118]
         +- LocalRelation [_1#114, _2#115]
org.apache.spark.sql.AnalysisException: cannot resolve 'array_union(NULL, `b`)' due to data type mismatch: Element type in both arrays must be the same;;
'Project [array_union(null, b#118) AS array_union(a, b)#121]
+- AnalysisBarrier
      +- Project [_1#114 AS a#117, _2#115 AS b#118]
         +- LocalRelation [_1#114, _2#115]
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Maybe the purpose of the test is not what I thought.
Seems like what I wanted is included in the latest updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in ArraySetUtils object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this abstraction is good or not. The final version seems complex because of a bunch of if-else.
I'd rather introduce abstract methods for the difference and override them in the subclasses.
| Test build #89621 has finished for PR 21061 at commit  
 | 
| Test build #89624 has finished for PR 21061 at commit  
 | 
| Test build #89625 has finished for PR 21061 at commit  
 | 
| Test build #92716 has finished for PR 21061 at commit  
 | 
| retest this please | 
| Test build #92717 has finished for PR 21061 at commit  
 | 
| Test build #92718 has finished for PR 21061 at commit  
 | 
| Test build #92720 has finished for PR 21061 at commit  
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for a few comments.
| ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) | ||
| case dt => dt | ||
| }.getOrElse(StringType) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override def dataType: DataType = {
  val dataTypes = children.map(_.dataType.asInstanceOf[ArrayType])
  ArrayType(elementType, dataTypes.exists(_.containsNull))
}should work?
| // store elements into resultArray | ||
| var nullElementSize = 0 | ||
| var pos = 0 | ||
| Seq(array1, array2).foreach(array => { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: foreach { array =>?
| val a30 = Literal.create(Seq(null, null), ArrayType(IntegerType)) | ||
| val a31 = Literal.create(null, ArrayType(StringType)) | ||
|  | ||
| checkEvaluation(ArrayUnion(a00, a01), UnsafeArrayData.fromPrimitiveArray(Array(1, 2, 3, 4))) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need to use UnsafeArrayData here. Seq(1, 2, 3, 4) should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
containsNull instead of cn?
| return fromPrimitiveArray(null, offset, length, elementSize); | ||
| } | ||
|  | ||
| public static boolean useGenericArrayData(int elementSize, int length) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: canUseGenericArrayData
| @since(2.4) | ||
| def array_union(col1, col2): | ||
| """ | ||
| Collection function: returns an array of the elements in the union of col1 and col2, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the array of col1 contains duplicate elements itself, what it does? de-duplicate them too?
E.g.,
df = spark.createDataFrame([Row(c1=["b", "a", "c", "c"], c2=["c", "d", "a", "f"])])
df.select(array_union(df.c1, df.c2)).collect()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the code, seems it de-duplicates all elements from two arrays. Is this behavior the same as Presto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add the tests for duplication.
Yes, this will de-duplicate. I think that it is the same behavior as Presto.
| private static UnsafeArrayData fromPrimitiveArray( | ||
| public static UnsafeArrayData fromPrimitiveArray( | ||
| Object arr, int offset, int length, int elementSize) { | ||
| final long headerInBytes = calculateHeaderPortionInBytes(length); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this logic extracted to useGenericArrayData? If so, can we re-use it by calling the method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this thread an answer to this question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
| } | ||
|  | ||
|  | ||
| abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe what ArraySetLike is intended for by adding comment?
| // calculate result array size | ||
| hsLong = new OpenHashSet[Long] | ||
| val elements = evalIntLongPrimitiveType(array1, array2, null, true) | ||
| hsLong = new OpenHashSet[Long] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we obtain unique elements of two arrays in the hash set, can't we get final array elements from it directly instead of scanning two arrays again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be. Originally, I took that approach.
After discussed with @ueshin, I decided to generate a result array from the original arrays instead of the hash. This is because we generate a result array in a unique deterministic order among the different paths in array_union.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, though I think there will be some performance issue.
| var i = 0 | ||
| while (i < array.numElements()) { | ||
| if (array.isNullAt(i)) { | ||
| if (!foundNullElement) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This two if can be combined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not easy since we want to do nothing if array.isNullAt(i) && foundNullElement is true.
| @since(2.4) | ||
| def array_union(col1, col2): | ||
| """ | ||
| Collection function: returns an array of the elements in the union of col1 and col2, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the code, seems it de-duplicates all elements from two arrays. Is this behavior the same as Presto?
| Test build #92769 has finished for PR 21061 at commit  
 | 
| retest this please | 
| Test build #92779 has finished for PR 21061 at commit  
 | 
| retest this please | 
| Test build #92789 has finished for PR 21061 at commit  
 | 
| Jenkins, retest this please. | 
| Test build #92807 has finished for PR 21061 at commit  
 | 
| @viirya Do you have any other comments on this? Thanks! | 
| return fromPrimitiveArray(null, offset, length, elementSize); | ||
| } | ||
|  | ||
| public static boolean canUseGenericArrayData(int elementSize, int length) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I've suggested this naming, but seems it is should be something like shouldUseGenericArrayData.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both make sense. I will follow your suggestion.
| s"get$ptName($i)", s"set$ptName($pos, $value)", CodeGenerator.javaType(elementType), | ||
| if (elementType == LongType) "(long)" else "(int)", | ||
| s""" | ||
| |${ctx.createUnsafeArray(unsafeArray, size, elementType, s" $prettyName failed.")} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we don't automatically choose to use GenericArrayData as the same as interpreted path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment is correct. It would be good to address this choice in another PR to update ctx.createUnsafeArray.
cc: @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean a refactoring around the usage of createUnsafeArray through new collection functions in another PR? If so, I'm okay with doing it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean a refactoring the usage of createUnsafeArray thru new collection functions.
| Few minor comments. otherwise LGTM. | 
| Test build #92861 has finished for PR 21061 at commit  
 | 
| SGTM… On Thu, Jul 12, 2018, 12:19 PM Takuya UESHIN ***@***.***> wrote:
 ***@***.**** commented on this pull request.
 ------------------------------
 In
 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 <#21061 (comment)>:
 > +    val i = ctx.freshName("i")
 +    val pos = ctx.freshName("pos")
 +    val value = ctx.freshName("value")
 +    val size = ctx.freshName("size")
 +    val (postFix, openHashElementType, getter, setter, javaTypeName, castOp, arrayBuilder) =
 +      if (elementTypeSupportEquals) {
 +        elementType match {
 +          case ByteType | ShortType | IntegerType | LongType =>
 +            val ptName = CodeGenerator.primitiveTypeName(elementType)
 +            val unsafeArray = ctx.freshName("unsafeArray")
 +            (if (elementType == LongType) s"$$mcJ$$sp" else s"$$mcI$$sp",
 +              if (elementType == LongType) "Long" else "Int",
 +              s"get$ptName($i)", s"set$ptName($pos, $value)", CodeGenerator.javaType(elementType),
 +              if (elementType == LongType) "(long)" else "(int)",
 +              s"""
 +                 |${ctx.createUnsafeArray(unsafeArray, size, elementType, s" $prettyName failed.")}
 Do you mean a refactoring around the usage of createUnsafeArray through
 new collection functions in another PR? If so, I'm okay with doing it in
 another PR.
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#21061 (comment)>, or mute
 the thread
 <https://github.com/notifications/unsubscribe-auth/AAEM9yfWA_RnPbO_fsjFD4rCxpHL0glqks5uFsA3gaJpZM4TS6Ng>
 .
 | 
| Thanks! merging to master. | 
What changes were proposed in this pull request?
The PR adds the SQL function
array_union. The behavior of the function is based on Presto's one.This function returns returns an array of the elements in the union of array1 and array2.
Note: The order of elements in the result is not defined.
How was this patch tested?
Added UTs