-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23912][SQL]add array_distinct #21050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq.empty[Integer]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for your comments. Will fix the problems. Sorry for the late reply.
Test build #89236 has finished for PR 21050 at commit
|
retest this please |
Test build #89276 has finished for PR 21050 at commit
|
Test build #89793 has finished for PR 21050 at commit
|
Test build #89813 has finished for PR 21050 at commit
|
Test build #89862 has finished for PR 21050 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you check $array.isNullAt($j)
in this loop as well? Especially, when elementType
is primitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering about cases with big arrays with lots of duplicated values. In such cases, $tempArray
is unnecesserily big. What about performing the filtering in two runs? The first run would calculate the result array size and the second would copy items from the source to the result?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke Thanks for your comments. I will have two runs as you suggested: one to get the distinct array size and another to copy the value. Will also add the $array.isNullAt($j)
cc @ueshin |
Test build #90003 has finished for PR 21050 at commit
|
retest this please |
Test build #90028 has finished for PR 21050 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess nested loop must be very slow. Can you use a hash table or something to check duplication? Maybe OpenHashSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Thanks for your comment. I will change the code to use OpenHashSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep the element order? If so, does distinct
keep the order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, distinct keeps the order.
Test build #90766 has finished for PR 21050 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, do we need to execute this then-block only for the first null element in $array
?
In other words, if we introduce a flag like foundNullElement
, we can avoid to execute a loop for the rest of null elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk Thanks for your comment. I have changed the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work for boolean
, float
, or double
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It works for boolean
, float
and double
. Here are the generated code:
org.apache.spark.util.collection.OpenHashSet project_hs_1 = new org.apache.spark.util.collection.OpenHashSet(scala.reflect.ClassTag$.MODULE$.Boolean());
org.apache.spark.util.collection.OpenHashSet project_hs_1 = new org.apache.spark.util.collection.OpenHashSet(scala.reflect.ClassTag$.MODULE$.Double());
org.apache.spark.util.collection.OpenHashSet project_hs_1 = new org.apache.spark.util.collection.OpenHashSet(scala.reflect.ClassTag$.MODULE$.Float());
I also added UT for boolean
, float
and double
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it should work. But, it is not specialized for boolean
, float
, or double
. nvm.
Thanks.
Test build #90765 has finished for PR 21050 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Thanks!
Test build #90801 has finished for PR 21050 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the super delay.
After I reviewed this for the first time, we found we can't use hash for some data types.
Could you see #21028 and follow the implementation there? In short, we can't support MapType
element, and we need to use nested loop for BinaryType
or other non-primitive types as you did before.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can skip some checks here for just counting the distinct array length, such as if (!($foundNullElement))
or if (!($hs.contains($getValue)))
.
We can simply do $foundNullElement = true
if null found, otherwise $hs.add($getValue)
, and the length will be $hs.size() + ($foundNullElement ? 1 : 0)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Thanks for your review. I made some changes based on your comments. Could you please review one more time? Thanks!
Test build #91726 has finished for PR 21050 at commit
|
Test build #91732 has finished for PR 21050 at commit
|
|$openHashSet $hs = new $openHashSet($classTag); | ||
|for (int $i = 0; $i < $array.numElements(); $i++) { | ||
| if ($array.isNullAt($i)) { | ||
| if (!($foundNullElement)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to check this and can do simply $foundNullElement = true;
?
| } | ||
| } | ||
| else { | ||
| if (!($hs.contains($getValue1))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
""".stripMargin | ||
} | ||
|
||
private def setValueForbruteForceEval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: setValueForBruteForceEval
?
s""" | ||
|int $j; | ||
|for ($j = 0; $j < $i; $j ++) { | ||
| if (!$inputArray.isNullAt($j) && $isEqual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like {
is missing?
| $setValue; | ||
| $pos = $pos + 1; | ||
| } | ||
""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like }
is missing?
s""" | ||
|$distinctArray[$pos] = null; | ||
""". | ||
stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay to use s"$distinctArray[$pos] = null"
for 1-line code.
s""" | ||
|$distinctArray.setNullAt($pos); | ||
""". | ||
stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
if (!isPrimitive) { | ||
s""" | ||
|$distinctArray[$pos] = $getValue1; | ||
""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
} else { | ||
s""" | ||
|$distinctArray.set$primitiveValueTypeName($pos, $getValue1); | ||
""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|${ev.value} = $distinctArray; | ||
""".stripMargin | ||
} else { | ||
val setValueForbruteForce = setValueForbruteForceEval(true, i, j, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this clause because primitive types always support equals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Thanks fo your comments. I have made changes. Could you please take a look again? Thanks!
Test build #91992 has finished for PR 21050 at commit
|
Test build #91993 has finished for PR 21050 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for some nits.
""".stripMargin | ||
} else { | ||
val setValueForBruteForce = setValueForBruteForceEval(false, i, j, | ||
inputArray, distinctArray, pos, getValue1, isEqual, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
val setNullForPrimitive = setNull(true, foundNullElement, distinctArray, pos) | ||
val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$primitiveValueTypeName()" | ||
val setValueForFast = | ||
setValueForFastEval(true, hs, distinctArray, pos, getValue1, primitiveValueTypeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
} | ||
j = j + 1 | ||
} | ||
if (i == j-1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (i == j - 1)
?
} else { | ||
var foundNullElement = false | ||
var pos = 0 | ||
for(i <- 0 until data.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for (
?
val openHashSet = classOf[OpenHashSet[_]].getName | ||
val hs = ctx.freshName("hs") | ||
val classTag = s"scala.reflect.ClassTag$$.MODULE$$.Object()" | ||
if(elementTypeSupportEquals) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if (
?
|$openHashSet $hs = new $openHashSet($classTag); | ||
|for (int $i = 0; $i < $inputArray.numElements(); $i ++) { | ||
| if ($inputArray.isNullAt($i)) { | ||
| $setNullForPrimitive; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
Test build #92148 has finished for PR 21050 at commit
|
Thanks! merging to master. |
Thank you very much for your help! |
What changes were proposed in this pull request?
Add array_distinct to remove duplicate value from the array.
How was this patch tested?
Add unit tests