-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-9741][SQL] Approximate Count Distinct using the new UDAF interface. #8362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
This made my day. The approach is super cool. Couple suggestions:
Beyond this, would be cool to have count-min sketch too! (future work) In the past I had created a ticket to track streaming algorithms: https://issues.apache.org/jira/browse/SPARK-6760 |
|
Test build #41377 has finished for PR 8362 at commit
|
|
Thanks. I was aiming for compatibility with the existing approxCountDistinct, but we can also implement HLL++. HLL++ introduces three (orthogonal) refinements: 64-bit hashing, better low cardinality corrections and a sparse encoding scheme. The first two refinements are easy to add. The third will require a bit more effort. Unit testing this is a bit of a challenge. End-to-end (blackbox) testing is no problem, as long as we know what the result should be, or if we do random testing (results should be within 5% of the actual value). Testing parts of the algorithm is a bit of a PITA:
Both the ClearSpring and AggregateKnowledge implementations resort to blackbox testing. I will create some blackbox tests. |
|
Thanks - I think blackbox testing is fine. But it would be great to apply that at the "unit" testing level, i.e. running directly against the aggregate function, rather than against Spark SQL end to end. |
|
Implemented initial non-sparse HLL++. I am going to take a look at the sparse version next week. The results are still equal to the Clearspring HLL+ implementation in non-sparse mode. I also need to clean-up the docs for the main HLL++ class a bit. |
|
Test build #41719 has finished for PR 8362 at commit
|
|
@hvanhovell do you mind closing this pull request, and re-open when you feel it is ready for review again? |
|
@rxin the dense version of HLL++ is ready. We could also add this, and add the sparse logic in a follow-up PR. Let me know what you think. I'll close if you'd rather do everything in one go. |
|
Ah ok. will add this to our sprint backlog and get somebody to review it soon. |
|
One quick note: https://github.com/twitter/algebird/pull/491/files anything we can learn from the above pr? |
|
@hvanhovell as discussed on the dev mailing list, perhaps it would be interesting to allow the return type to include the aggregated HLL registers. This could be (for example) in the form of Is it possible to specify input arguments for |
|
@hvanhovell @rxin is it intended that this replace the existing |
|
@MLnick This one will replace the existing implementation. For now, we will do conversion as shown at https://github.com/apache/spark/pull/8362/files#diff-78b9b210b8cee72e7097bc1af44bd315L98. Later, we will remove the old implementation ( |
|
@MLnick I am in the process of moving house, so I am a bit slow/late with my response :(... I think it is very usefull to be able to return the HLL registers to the users (it could also be nice to use in cost based planning). I would rather give it a different name though The UDAF should support a rsd parameter. Doesn't it? I'll add a test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we round it up?
|
I took a round, this looks pretty good to me over all. Currently, each grouping key needs 200 bytes (b=8, by default), so the sparse version could help to reduce the memory usage in case of the average number of distinct value is small (I believe it's a common case). Since we already support external aggregation (using sort based), so it's not critical, could be a optional improvement (separated PR). Had ran a small benchmark for this patch, I'm surprised it's slower than 1.5 (using old aggregation). The test code: It took 3.4 seconds in 1.5, but 6.4 seconds with this patch. The plain in 1.5: The plan with this patch: Discussed this with @yhuai , the slowness may come from the new aggregation, that only not support AAlgebraicAggregate in hash mode, we will fix that in 1.6. |
|
@yhuai can we make non-codegen path use tungsten aggregate as well? Otherwise we would need to maintain two entirely separate codepath. |
|
@hvanhovell @rxin Just realized that the tungsten aggregation does not support var-length types in aggregation buffer, so we can't have sparse version without aggregation changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line here.
|
We can work on improving the aggregate operator. |
|
Test build #43132 has finished for PR 8362 at commit
|
|
LGTM, merging this into master, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell Does HLL++ require using hash64? I took a look at the implementation of it. Looks we will convert the input value to a java string in many cases (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/hash/MurmurHash.java#L135-L159). For our old function, the offer method of HyperLogLog class use hash internally, which has some specializations.
|
@yhuai It doesn't. A 64-bit hashcode is recommended though, especially when would want to approximate a billion or more unique values. I have used the ClearSpring hashcode, because this enabled me to compare the results of my HLL++ implementation to theirs. We could replace it with another, better performing, one; don't we have one in Spark? We could also scale down to 32-bits... |
|
A good article on HLL++ and the hashcode: http://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll |
|
Thanks for the pointer. Looks like we only have a 32-bit Murmur3 hasher in spark's unsafe module (https://github.com/apache/spark/blob/master/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java). |
|
Another thought on hashing. The ClearSpring hash is a generic hash function. We could used very specialized (hopefully fast) hashing functions, because we know the type of our input. |
|
we can create a hash expression, and codegen that. And then just use hyperloglog(hash(field)). |
This PR implements a HyperLogLog based Approximate Count Distinct function using the new UDAF interface.
The implementation is inspired by the ClearSpring HyperLogLog implementation and should produce the same results.
There is still some documentation and testing left to do.
cc @yhuai