[SPARK-9741][SQL] Approximate Count Distinct using the new UDAF interface. #8362

hvanhovell · 2015-08-21T18:03:07Z

This PR implements a HyperLogLog based Approximate Count Distinct function using the new UDAF interface.

The implementation is inspired by the ClearSpring HyperLogLog implementation and should produce the same results.

There is still some documentation and testing left to do.

cc @yhuai

yhuai · 2015-08-21T18:27:02Z

ok to test

rxin · 2015-08-21T19:55:46Z

This made my day. The approach is super cool.

Couple suggestions:

Can we use HyperLogLogPlus? It's also in streamlib: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java
Can we write this in a way to make it more unit testable?

Beyond this, would be cool to have count-min sketch too! (future work) In the past I had created a ticket to track streaming algorithms: https://issues.apache.org/jira/browse/SPARK-6760

SparkQA · 2015-08-21T20:58:30Z

Test build #41377 has finished for PR 8362 at commit e178d9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HyperLogLog(child: Expression, relativeSD: Double = 0.05)

hvanhovell · 2015-08-21T21:23:15Z

Thanks.

I was aiming for compatibility with the existing approxCountDistinct, but we can also implement HLL++. HLL++ introduces three (orthogonal) refinements: 64-bit hashing, better low cardinality corrections and a sparse encoding scheme. The first two refinements are easy to add. The third will require a bit more effort.

Unit testing this is a bit of a challenge. End-to-end (blackbox) testing is no problem, as long as we know what the result should be, or if we do random testing (results should be within 5% of the actual value). Testing parts of the algorithm is a bit of a PITA:

It is hard to reason about the results (the updated registers) HLL produces.
Register access code and HLL code are intertwined.

Both the ClearSpring and AggregateKnowledge implementations resort to blackbox testing. I will create some blackbox tests.

rxin · 2015-08-21T21:26:33Z

Thanks - I think blackbox testing is fine. But it would be great to apply that at the "unit" testing level, i.e. running directly against the aggregate function, rather than against Spark SQL end to end.

hvanhovell · 2015-08-28T02:56:30Z

Implemented initial non-sparse HLL++. I am going to take a look at the sparse version next week. The results are still equal to the Clearspring HLL+ implementation in non-sparse mode.

I also need to clean-up the docs for the main HLL++ class a bit.

SparkQA · 2015-08-28T05:29:05Z

Test build #41719 has finished for PR 8362 at commit 1ea722b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

rxin · 2015-09-07T22:08:20Z

@hvanhovell do you mind closing this pull request, and re-open when you feel it is ready for review again?

hvanhovell · 2015-09-08T09:24:25Z

@rxin the dense version of HLL++ is ready. We could also add this, and add the sparse logic in a follow-up PR. Let me know what you think. I'll close if you'd rather do everything in one go.

rxin · 2015-09-08T21:03:26Z

Ah ok. will add this to our sprint backlog and get somebody to review it soon.

rxin · 2015-09-10T23:04:14Z

One quick note: https://github.com/twitter/algebird/pull/491/files

anything we can learn from the above pr?

MLnick · 2015-09-14T09:43:18Z

@hvanhovell as discussed on the dev mailing list, perhaps it would be interesting to allow the return type to include the aggregated HLL registers. This could be (for example) in the form of StructType {'cardinality':Long, 'hll': Array[Byte]}, where the hll is in the same serialized form that can be used to instantiate say a StreamLib or Algebird HLL class for use outside of Spark.

Is it possible to specify input arguments for rsd? So SELECT APPROX DISTINCT(column, 0.1) FROM ...? If so, then another option is to add a further argument such as returnHLL: Boolean = false so that either the raw HLL or the cardinality is returned?

MLnick · 2015-09-14T09:46:02Z

@hvanhovell @rxin is it intended that this replace the existing approxCountDistinct implementation? And I assume this will happen automatically due to extending AggregateFunction2?

yhuai · 2015-09-14T16:50:05Z

@MLnick This one will replace the existing implementation. For now, we will do conversion as shown at https://github.com/apache/spark/pull/8362/files#diff-78b9b210b8cee72e7097bc1af44bd315L98. Later, we will remove the old implementation (AggregateFunction1 interface).

hvanhovell · 2015-09-22T15:27:13Z

@MLnick I am in the process of moving house, so I am a bit slow/late with my response :(...

I think it is very usefull to be able to return the HLL registers to the users (it could also be nice to use in cost based planning). I would rather give it a different name though createHLLRegisters for instance (the name needs work), to make it clear that we are doing something different.

The UDAF should support a rsd parameter. Doesn't it? I'll add a test.

davies · 2015-09-28T21:48:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala

Should we round it up?

davies · 2015-09-28T23:39:53Z

I took a round, this looks pretty good to me over all.

Currently, each grouping key needs 200 bytes (b=8, by default), so the sparse version could help to reduce the memory usage in case of the average number of distinct value is small (I believe it's a common case). Since we already support external aggregation (using sort based), so it's not critical, could be a optional improvement (separated PR).

Had ran a small benchmark for this patch, I'm surprised it's slower than 1.5 (using old aggregation). The test code:

df = sqlContext.range(1<<25).agg(approxCountDistinct("id"))
df.explain()
t = time.time()
print df.collect()
print time.time() - t

It took 3.4 seconds in 1.5, but 6.4 seconds with this patch.

The plain in 1.5:

Aggregate false, [APPROXIMATE COUNT(DISTINCT PartialApproxCountDistinct#2) AS APPROXIMATE COUNT(DISTINCT id)#1L]
 Exchange SinglePartition
  Aggregate true, [APPROXIMATE COUNT(DISTINCT id#0L) AS PartialApproxCountDistinct#2]
   Scan PhysicalRDD[id#0L]

The plan with this patch:

SortBasedAggregate(key=[], functions=[(hyperloglogplusplus(id#0L),mode=Final,isDistinct=false)], output=[APPROXIMATE COUNT(DISTINCT id)#1L])
 ConvertToSafe
  TungstenExchange SinglePartition
   ConvertToUnsafe
    SortBasedAggregate(key=[], functions=[(hyperloglogplusplus(id#0L),mode=Partial,isDistinct=false)], output=[MS[0]#30L,MS[1]#31L,MS[2]#32L,MS[3]#33L,MS[4]#34L,MS[5]#35L,MS[6]#36L,MS[7]#37L,MS[8]#38L,MS[9]#39L,MS[10]#40L,MS[11]#41L,MS[12]#42L,MS[13]#43L,MS[14]#44L,MS[15]#45L,MS[16]#46L,MS[17]#47L,MS[18]#48L,MS[19]#49L,MS[20]#50L,MS[21]#51L,MS[22]#52L,MS[23]#53L,MS[24]#54L,MS[25]#55L])
     Scan PhysicalRDD[id#0L]

Discussed this with @yhuai , the slowness may come from the new aggregation, that only not support AAlgebraicAggregate in hash mode, we will fix that in 1.6.

rxin · 2015-09-29T05:04:11Z

@yhuai can we make non-codegen path use tungsten aggregate as well? Otherwise we would need to maintain two entirely separate codepath.

davies · 2015-09-29T16:52:08Z

@hvanhovell @rxin Just realized that the tungsten aggregation does not support var-length types in aggregation buffer, so we can't have sparse version without aggregation changes.

davies · 2015-09-29T16:52:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala

new line here.

rxin · 2015-09-29T18:00:16Z

We can work on improving the aggregate operator.

SparkQA · 2015-09-30T15:10:58Z

Test build #43132 has finished for PR 8362 at commit a5fdd07.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

davies · 2015-09-30T17:12:01Z

LGTM, merging this into master, thanks!

yhuai · 2015-10-14T21:15:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala

@hvanhovell Does HLL++ require using hash64? I took a look at the implementation of it. Looks we will convert the input value to a java string in many cases (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/hash/MurmurHash.java#L135-L159). For our old function, the offer method of HyperLogLog class use hash internally, which has some specializations.

hvanhovell · 2015-10-14T21:35:54Z

@yhuai It doesn't. A 64-bit hashcode is recommended though, especially when would want to approximate a billion or more unique values. I have used the ClearSpring hashcode, because this enabled me to compare the results of my HLL++ implementation to theirs.

We could replace it with another, better performing, one; don't we have one in Spark? We could also scale down to 32-bits...

hvanhovell · 2015-10-14T21:36:48Z

A good article on HLL++ and the hashcode: http://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll

yhuai · 2015-10-14T22:56:26Z

Thanks for the pointer. Looks like we only have a 32-bit Murmur3 hasher in spark's unsafe module (https://github.com/apache/spark/blob/master/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java).

hvanhovell · 2015-10-15T07:36:39Z

Another thought on hashing. The ClearSpring hash is a generic hash function. We could used very specialized (hopefully fast) hashing functions, because we know the type of our input.

rxin · 2015-10-15T07:41:35Z

we can create a hash expression, and codegen that. And then just use hyperloglog(hash(field)).

hvanhovell added 3 commits August 20, 2015 22:32

Created HyperLogLog aggregate.

f52de02

Added HLL to conversions. More doc. Improvement.

8ec27b9

Bug fixes. Style. Documentation.

e178d9e

Updated to HLL++ except for sparse representation. Added Unit Test.

1ea722b

MLnick mentioned this pull request Sep 22, 2015

[SPARK-6763][SQL] Add CountMinSketch to DataFrame for estimating frequencies #6416

Closed

davies reviewed Sep 28, 2015
View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala Outdated

Copy link

Contributor

davies Sep 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we round it up?

davies reviewed Sep 29, 2015
View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala Outdated

Copy link

Contributor

davies Sep 29, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line here.

Improvements after review: documentation, naming and testing.

a5fdd07

asfgit closed this in 16fd2a2 Sep 30, 2015

yhuai reviewed Oct 14, 2015
View reviewed changes

Uh oh!

[SPARK-9741][SQL] Approximate Count Distinct using the new UDAF interface. #8362

[SPARK-9741][SQL] Approximate Count Distinct using the new UDAF interface. #8362

Uh oh!

Conversation

hvanhovell commented Aug 21, 2015

Uh oh!

yhuai commented Aug 21, 2015

Uh oh!

rxin commented Aug 21, 2015

Uh oh!

SparkQA commented Aug 21, 2015

Uh oh!

hvanhovell commented Aug 21, 2015

Uh oh!

rxin commented Aug 21, 2015

Uh oh!

hvanhovell commented Aug 28, 2015

Uh oh!

SparkQA commented Aug 28, 2015

Uh oh!

rxin commented Sep 7, 2015

Uh oh!

hvanhovell commented Sep 8, 2015

Uh oh!

rxin commented Sep 8, 2015

Uh oh!

rxin commented Sep 10, 2015

Uh oh!

MLnick commented Sep 14, 2015

Uh oh!

MLnick commented Sep 14, 2015

Uh oh!

yhuai commented Sep 14, 2015

Uh oh!

hvanhovell commented Sep 22, 2015

Uh oh!

davies Sep 28, 2015

Choose a reason for hiding this comment

Uh oh!

davies commented Sep 28, 2015

Uh oh!

rxin commented Sep 29, 2015

Uh oh!

davies commented Sep 29, 2015

Uh oh!

davies Sep 29, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Sep 29, 2015

Uh oh!

SparkQA commented Sep 30, 2015

Uh oh!

davies commented Sep 30, 2015

Uh oh!

yhuai Oct 14, 2015

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Oct 14, 2015

Uh oh!

hvanhovell commented Oct 14, 2015

Uh oh!

yhuai commented Oct 14, 2015

Uh oh!

hvanhovell commented Oct 15, 2015

Uh oh!

rxin commented Oct 15, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants