Skip to content

Higher log2m value not always producing more accurate estimates #15

@hossman

Description

@hossman

My understanding from the HLL algorithm (which may be flawed, in which case please correct me and close this issue) is that for any fixed set of input values, the accuracy of any estimate from an HLL built from those values should increase as the "m" value used in the HLL increases.

Ie:

if you build 2 HLL instances, with different log2m settings, and add the exact same set of (raw) values to both, then the HLL with the larger log2m will give you the most accurate results then the HLL with a smaller log2m setting.

In my testing however, I'm frequently encountering situations where "smaller" HLL instances are producing more accurate cardinality estimates -- which I can't explain.

I've created a reproducible test case that demonstrates the problem, which i will post as a separate comment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions