Skip to content

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add array_distinct to remove duplicate value from the array.

How was this patch tested?

Add unit tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq.empty[Integer]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for your comments. Will fix the problems. Sorry for the late reply.

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89236 has finished for PR 21050 at commit 3bfb23e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayDistinct(child: Expression)

@kiszk
Copy link
Member

kiszk commented Apr 12, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89276 has finished for PR 21050 at commit 3bfb23e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayDistinct(child: Expression)

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89793 has finished for PR 21050 at commit e3f7ee6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayDistinct(child: Expression)

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89813 has finished for PR 21050 at commit 91c3076.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 26, 2018

Test build #89862 has finished for PR 21050 at commit f55abae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you check $array.isNullAt($j) in this loop as well? Especially, when elementType is primitive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering about cases with big arrays with lots of duplicated values. In such cases, $tempArray is unnecesserily big. What about performing the filtering in two runs? The first run would calculate the result array size and the second would copy items from the source to the result?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke Thanks for your comments. I will have two runs as you suggested: one to get the distinct array size and another to copy the value. Will also add the $array.isNullAt($j)

@gatorsmile
Copy link
Member

cc @ueshin

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90003 has finished for PR 21050 at commit e57056b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90028 has finished for PR 21050 at commit e57056b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess nested loop must be very slow. Can you use a hash table or something to check duplication? Maybe OpenHashSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thanks for your comment. I will change the code to use OpenHashSet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep the element order? If so, does distinct keep the order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, distinct keeps the order.

@SparkQA
Copy link

SparkQA commented May 18, 2018

Test build #90766 has finished for PR 21050 at commit 647528f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@kiszk kiszk May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, do we need to execute this then-block only for the first null element in $array?
In other words, if we introduce a flag like foundNullElement, we can avoid to execute a loop for the rest of null elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk Thanks for your comment. I have changed the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work for boolean, float, or double?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It works for boolean, float and double. Here are the generated code:

org.apache.spark.util.collection.OpenHashSet project_hs_1 = new org.apache.spark.util.collection.OpenHashSet(scala.reflect.ClassTag$.MODULE$.Boolean());

org.apache.spark.util.collection.OpenHashSet project_hs_1 = new org.apache.spark.util.collection.OpenHashSet(scala.reflect.ClassTag$.MODULE$.Double());

org.apache.spark.util.collection.OpenHashSet project_hs_1 = new org.apache.spark.util.collection.OpenHashSet(scala.reflect.ClassTag$.MODULE$.Float());

I also added UT for boolean, float and double.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it should work. But, it is not specialized for boolean, float, or double. nvm.
Thanks.

@SparkQA
Copy link

SparkQA commented May 18, 2018

Test build #90765 has finished for PR 21050 at commit 9a42723.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@kiszk kiszk May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add test cases with complex types (e.g. Array[Binary] or others)? See #21361.
cc @ueshin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Thanks!

@SparkQA
Copy link

SparkQA commented May 18, 2018

Test build #90801 has finished for PR 21050 at commit 8ae2db0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the super delay.
After I reviewed this for the first time, we found we can't use hash for some data types.
Could you see #21028 and follow the implementation there? In short, we can't support MapType element, and we need to use nested loop for BinaryType or other non-primitive types as you did before.
Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can skip some checks here for just counting the distinct array length, such as if (!($foundNullElement)) or if (!($hs.contains($getValue))).

We can simply do $foundNullElement = true if null found, otherwise $hs.add($getValue), and the length will be $hs.size() + ($foundNullElement ? 1 : 0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thanks for your review. I made some changes based on your comments. Could you please review one more time? Thanks!

@SparkQA
Copy link

SparkQA commented Jun 12, 2018

Test build #91726 has finished for PR 21050 at commit 298954f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91732 has finished for PR 21050 at commit ba0d60f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|$openHashSet $hs = new $openHashSet($classTag);
|for (int $i = 0; $i < $array.numElements(); $i++) {
| if ($array.isNullAt($i)) {
| if (!($foundNullElement)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to check this and can do simply $foundNullElement = true;?

| }
| }
| else {
| if (!($hs.contains($getValue1))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

""".stripMargin
}

private def setValueForbruteForceEval(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: setValueForBruteForceEval?

s"""
|int $j;
|for ($j = 0; $j < $i; $j ++) {
| if (!$inputArray.isNullAt($j) && $isEqual)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like { is missing?

| $setValue;
| $pos = $pos + 1;
| }
""".stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like } is missing?

s"""
|$distinctArray[$pos] = null;
""".
stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to use s"$distinctArray[$pos] = null" for 1-line code.

s"""
|$distinctArray.setNullAt($pos);
""".
stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

if (!isPrimitive) {
s"""
|$distinctArray[$pos] = $getValue1;
""".stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

} else {
s"""
|$distinctArray.set$primitiveValueTypeName($pos, $getValue1);
""".stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

|${ev.value} = $distinctArray;
""".stripMargin
} else {
val setValueForbruteForce = setValueForbruteForceEval(true, i, j,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this clause because primitive types always support equals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thanks fo your comments. I have made changes. Could you please take a look again? Thanks!

@SparkQA
Copy link

SparkQA commented Jun 17, 2018

Test build #91992 has finished for PR 21050 at commit 2df2cff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 17, 2018

Test build #91993 has finished for PR 21050 at commit 2d77880.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some nits.

""".stripMargin
} else {
val setValueForBruteForce = setValueForBruteForceEval(false, i, j,
inputArray, distinctArray, pos, getValue1, isEqual, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

val setNullForPrimitive = setNull(true, foundNullElement, distinctArray, pos)
val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$primitiveValueTypeName()"
val setValueForFast =
setValueForFastEval(true, hs, distinctArray, pos, getValue1, primitiveValueTypeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

}
j = j + 1
}
if (i == j-1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (i == j - 1)?

} else {
var foundNullElement = false
var pos = 0
for(i <- 0 until data.length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for (?

val openHashSet = classOf[OpenHashSet[_]].getName
val hs = ctx.freshName("hs")
val classTag = s"scala.reflect.ClassTag$$.MODULE$$.Object()"
if(elementTypeSupportEquals) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if (?

|$openHashSet $hs = new $openHashSet($classTag);
|for (int $i = 0; $i < $inputArray.numElements(); $i ++) {
| if ($inputArray.isNullAt($i)) {
| $setNullForPrimitive;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

@huaxingao
Copy link
Contributor Author

@ueshin @kiszk Thanks for your comments. I fixed the problems.
I am not sure if I should use $i++ or $i ++ in the for loop. It seems other people use $i ++, so I also used $i ++

@SparkQA
Copy link

SparkQA commented Jun 21, 2018

Test build #92148 has finished for PR 21050 at commit 3f5d03b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jun 21, 2018

Thanks! merging to master.

@asfgit asfgit closed this in 9de11d3 Jun 21, 2018
@huaxingao
Copy link
Contributor Author

Thank you very much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants