Skip to content

Questionable hash seed reuse between RepartitionExec and HashJoinExec #15620

@ctsk

Description

@ctsk

Describe the bug

HashJoinExec and RepartitionExec both instantiate a ahash::RandomState with the same seed:

Partitioning::Hash(exprs, num_partitions) => BatchPartitionerState::Hash {
exprs,
num_partitions,
// Use fixed random hash
random_state: ahash::RandomState::with_seeds(0, 0, 0, 0),
hash_buffer: vec![],
},

let random_state = RandomState::with_seeds(0, 0, 0, 0);

This means that both operators compute the same hash functions. This is problematic when a HashJoinExec has a RepartitionExec as a child: Because the RepartitionExec partitions based on the lowest k bits of the hash, each HashJoinStream finds that all the hashes it computes have the same k lowest bits! In theory this could make the HashTable work less efficiently / have more collisions than expected.

Why in theory? Because despite my best attempts, I could not construct a benchmark that showed that changing the hash seed for HJ made a performance difference. I suspect this is due to a combination of the fact that the underlying hashtable uses open addressing, hashbrown storing a bitmask based on the highest bits in the buckets to do some early filtering and other bottlenecks in the HJ.

Is there anything I missed?

To Reproduce

Use this branch: https://github.com/ctsk/datafusion/tree/experiment/collision-reproducer

I patched the HashJoinExec so that it emits a warning when all row hashes of a batch share common least significant bits.

Run RUST_LOG=warn ./bench.sh run tpch and see lots of warning emitted.

Expected behavior

No response

Additional context

Fix is simple: Change the hash seed in HashJoinExec - or don't provide any seed (use Default::default() like we do in AggregationExec.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions