Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package com.twitter.algebird.benchmark

import com.twitter.algebird.{ HyperLogLogMonoid, HLL }
import com.twitter.algebird.{ HyperLogLogMonoid, HLL, SparseHLL, DenseHLL }
import com.twitter.bijection._
import java.nio.ByteBuffer
import java.util.concurrent.TimeUnit
Expand Down Expand Up @@ -38,10 +38,18 @@ object HLLPresentBenchmark {
class HLLPresentBenchmark {
import HLLPresentBenchmark._

//don't cache the lazy values
def clone(hll: HLL): HLL = {
hll match {
case SparseHLL(bits, maxRhow) => SparseHLL(bits, maxRhow)
case DenseHLL(bits, v) => DenseHLL(bits, v)
}
}

@Benchmark
def timeBatchCreate(state: HLLPresentState, bh: Blackhole) = {
state.data.foreach { hll =>
bh.consume(hll.approximateSize)
bh.consume(clone(hll).approximateSize)
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ object HyperLogLog {
/* Size of the hash in bits */
val hashSize = 128

val powersOfNegativeTwo: Array[Double] = 0.until(hashSize).map{ i => math.pow(2.0, -i) }.toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to instead of until?

I think it's the rhoW that's taken here? On inspection the code seems a little not right (I could be missing something), but it seems like this is the count of the number of leading 0s, so it should include hashSize for the case that the hash is all 0s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're always setting aside at least 4 bits for the bucket, so I think the max is actually 124 - this should be slightly more than needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry looks like we raced on actions there. Can you have hashSize itself rather than hashSize -1? the tests are green, so it seems unlikely this is wrong per say right? or do we have a test missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. For the record the parts that look not right are in rhoW:

  • terminal condition is that contains returns false, but contains doesn't do any bounds checking; is it possible to overrun the array, however unlikely?
  • loop starts at 1 instead of 0. with even a small aggregation I imagine this doesn't matter, but isn't it technically incorrect?

Again I could be missing something, but if you think they might be worth addressing maybe lets file issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be negativePowersOfTwo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I'm confused as to what's being asked here.

@ianoc I'm quite certain we don't need more than hashSize-1
@jnievelt IIRC the definition of rhoW is "index of first non-zero", not "number of zeros" so that might explain starting at 1?
@kamalmarhubi argh, good point


def hash(input: Array[Byte]): Array[Byte] = {
val (l0, l1) = Hash128.arrayByteHash.hash(input)
pairLongs2Bytes(l0, l1)
Expand Down Expand Up @@ -358,7 +360,7 @@ case class SparseHLL(bits: Int, maxRhow: Map[Int, Max[Byte]]) extends HLL {

lazy val zeroCnt = size - maxRhow.size

lazy val z = 1.0 / (zeroCnt.toDouble + maxRhow.values.map { mj => HyperLogLog.twopow(-mj.get) }.sum)
lazy val z = 1.0 / (zeroCnt.toDouble + maxRhow.values.map { mj => HyperLogLog.powersOfNegativeTwo(mj.get) }.sum)

def +(other: HLL) = {

Expand Down Expand Up @@ -447,7 +449,7 @@ case class DenseHLL(bits: Int, v: Bytes) extends HLL {
count += 1
res += 1.0
} else {
res += java.lang.Math.pow(2.0, -mj)
res += HyperLogLog.powersOfNegativeTwo(mj)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be negativePowersOfTwo, not powersOfNegativeTwo? the current name implies we are doing (-2)^(mj) which is not the same as (2)^(-mj)

}
idx += 1
}
Expand Down