Skip to content

Conversation

@rfourquet
Copy link
Member

@rfourquet rfourquet commented Oct 27, 2019

Cf. https://discourse.julialang.org/t/iterate-and-intersect-on-bitset-vs-int/30345/5 for motivation.

A shorter (stupid) benchmark:

b1 = BitSet(1:1000)
b2 = BitSet(rand(1:1000, 500))
b3 = BitSet(rand(1:1000, 100))

function findval(set, x)
    for v in set
        if v == x
            return true
        end
    end
    false
end

Master:

julia> for b = (b1, b2, b3)
           @btime findval($b, 1001)
       end
  6.939 μs (0 allocations: 0 bytes)
  3.025 μs (0 allocations: 0 bytes)
  660.325 ns (0 allocations: 0 bytes)

PR:

julia> for b = (b1, b2, b3)
           @btime findval($b, 1001)
       end
  2.602 μs (0 allocations: 0 bytes)
  1.140 μs (0 allocations: 0 bytes)
  225.121 ns (0 allocations: 0 bytes)

EDIT: PR with _blsr:

julia> for b = (b1, b2, b3)
           @btime findval($b, 1001)
       end
  1.071 μs (0 allocations: 0 bytes)
  499.384 ns (0 allocations: 0 bytes)
  110.638 ns (0 allocations: 0 bytes)

This is a speedup of about 2.6x for b1 and b2, and almost 3x for b3.
This is a speedup of about 6x.

@rfourquet rfourquet added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Oct 27, 2019
@chethega
Copy link
Contributor

chethega commented Oct 27, 2019

Consider using the same algorithm as logical indexing, for a further ~40% speedup:

julia> @eval Base function iterate(s::BitSet, (word, idx) = (CHK0, 0))
       while (word==0)
           idx == length(s.bits) && return nothing
           idx += 1
           word = @inbounds s.bits[idx]
           end
           ret = (idx -1 + s.offset)<<6 + trailing_zeros(word)
           ret, (_blsr(word), idx)
       end

Reason for this is that TZCNT has high latency compared to BLSR, so we want it out of the dependency chain. This is better for superscalar execution.

On my machine:

julia> for b = (b1, b2, b3) # master
           @btime findval($b, 1001)
       end
  9.352 μs (0 allocations: 0 bytes)
  3.639 μs (0 allocations: 0 bytes)
  896.489 ns (0 allocations: 0 bytes)

julia> for b = (b1, b2, b3) # this PR
           @btime findval($b, 1001)
       end
  3.269 μs (0 allocations: 0 bytes)
  1.345 μs (0 allocations: 0 bytes)
  282.189 ns (0 allocations: 0 bytes)

julia> for b = (b1, b2, b3) #the proposed improvement
           @btime findval($b, 1001)
       end
  1.353 μs (0 allocations: 0 bytes)
  623.172 ns (0 allocations: 0 bytes)
  136.925 ns (0 allocations: 0 bytes)

@rfourquet
Copy link
Member Author

rfourquet commented Oct 28, 2019

Wow amazing! Actually, I thought that trailing_zero(word) == 64 might be more expensive that word == 0, but just putting it out of the loop was not enough to get better performance.

So in your example you still compute b = trailing_zeros(word). I would then naïvely believe that the difference is that _blsr(word) is way faster than xor(word, one(UInt64) << b)... but do I understood corretly from your explanation that actually the difference comes from that in your version, trainling_zeros is used only for the returned value ret, and not for the returned state, which means that the CPU can start doing stuff for the next iteration even before having computed ret? Sorry, fuzzy words but I'm not familiar with CPUs :D

Consider using the same algorithm as logical indexing

For those wondering, this is the findall(::BitArray) method, which I hadn't realized does indeed basically the same thing as BitSet iteration. And @chethega implemented a fast algorithn with this _blsr trick in #29746.

EDIT: actually, keeping b=trailing_zeros(word) in the while loop as it was originally, but using _blrs for computing the returned state (instead of masking word with a shifted bit) still results in significant speed-up compared to the originial PR version. So in itself (independantly of "dependency chain" considerations), using _blsr seems more efficient than masking with 1 << b (by an already computed b).

@mbauman
Copy link
Member

mbauman commented Oct 28, 2019

Looks like we do have some BitSet perf tests in the collection group.

@nanosoldier runbenchmarks("collection", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@vchuravy
Copy link
Member

Would be good to add a benchmark for this.

@mbauman
Copy link
Member

mbauman commented Oct 28, 2019

We just need to actually iterate over the entire container to see any advantage. The current tests just do it once twice.

https://github.com/JuliaCI/BaseBenchmarks.jl/blob/ce11d8badcb0e9f6d8b0518ffd57a472128af930/src/collection/CollectionBenchmarks.jl#L111-L114

@mbauman mbauman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 28, 2019
@JeffBezanson JeffBezanson merged commit 5e23b84 into master Oct 28, 2019
@JeffBezanson JeffBezanson deleted the rf/bitset-fastiter branch October 28, 2019 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

collections Data structures holding multiple items, e.g. sets performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants