The research paper claims the error rate to be 1.04/sqrt(m)
For log2m = 11 and w = 5 the theoretical error rate should be ~2.3% but in our tests we have seen a variance of upto 3.5% for inserting 100K unique alphanumeric UUIDs. Below is our pseudo code :
HLL hll = new HLL( 11, 5 );
HashFunction hashFunction = Hashing.murmur3_128();
for (String str : UUIDs) {
Hasher hasher = hashFunction.newHasher();
hasher.putString(str, StandardCharsets.UTF_8);
hll.addRaw(hasher.hash().asLong());
}
How can we reduce this additional 1.2% variance and fall in the theoretically claimed error rate ?