speedup BitSet iteration (by 2x or 3x) #33690

rfourquet · 2019-10-27T15:25:25Z

Cf. https://discourse.julialang.org/t/iterate-and-intersect-on-bitset-vs-int/30345/5 for motivation.

A shorter (stupid) benchmark:

b1 = BitSet(1:1000)
b2 = BitSet(rand(1:1000, 500))
b3 = BitSet(rand(1:1000, 100))

function findval(set, x)
    for v in set
        if v == x
            return true
        end
    end
    false
end

Master:

julia> for b = (b1, b2, b3)
           @btime findval($b, 1001)
       end
  6.939 μs (0 allocations: 0 bytes)
  3.025 μs (0 allocations: 0 bytes)
  660.325 ns (0 allocations: 0 bytes)

PR:

julia> for b = (b1, b2, b3)
           @btime findval($b, 1001)
       end
  2.602 μs (0 allocations: 0 bytes)
  1.140 μs (0 allocations: 0 bytes)
  225.121 ns (0 allocations: 0 bytes)

EDIT: PR with _blsr:

julia> for b = (b1, b2, b3)
           @btime findval($b, 1001)
       end
  1.071 μs (0 allocations: 0 bytes)
  499.384 ns (0 allocations: 0 bytes)
  110.638 ns (0 allocations: 0 bytes)

~~This is a speedup of about 2.6x for b1 and b2, and almost 3x for b3.~~
This is a speedup of about 6x.

chethega · 2019-10-27T21:44:53Z

Consider using the same algorithm as logical indexing, for a further ~40% speedup:

julia> @eval Base function iterate(s::BitSet, (word, idx) = (CHK0, 0))
       while (word==0)
           idx == length(s.bits) && return nothing
           idx += 1
           word = @inbounds s.bits[idx]
           end
           ret = (idx -1 + s.offset)<<6 + trailing_zeros(word)
           ret, (_blsr(word), idx)
       end

Reason for this is that TZCNT has high latency compared to BLSR, so we want it out of the dependency chain. This is better for superscalar execution.

On my machine:

julia> for b = (b1, b2, b3) # master
           @btime findval($b, 1001)
       end
  9.352 μs (0 allocations: 0 bytes)
  3.639 μs (0 allocations: 0 bytes)
  896.489 ns (0 allocations: 0 bytes)

julia> for b = (b1, b2, b3) # this PR
           @btime findval($b, 1001)
       end
  3.269 μs (0 allocations: 0 bytes)
  1.345 μs (0 allocations: 0 bytes)
  282.189 ns (0 allocations: 0 bytes)

julia> for b = (b1, b2, b3) #the proposed improvement
           @btime findval($b, 1001)
       end
  1.353 μs (0 allocations: 0 bytes)
  623.172 ns (0 allocations: 0 bytes)
  136.925 ns (0 allocations: 0 bytes)

rfourquet · 2019-10-28T08:56:11Z

Wow amazing! Actually, I thought that trailing_zero(word) == 64 might be more expensive that word == 0, but just putting it out of the loop was not enough to get better performance.

So in your example you still compute b = trailing_zeros(word). I would then naïvely believe that the difference is that _blsr(word) is way faster than xor(word, one(UInt64) << b)... but do I understood corretly from your explanation that actually the difference comes from that in your version, trainling_zeros is used only for the returned value ret, and not for the returned state, which means that the CPU can start doing stuff for the next iteration even before having computed ret? Sorry, fuzzy words but I'm not familiar with CPUs :D

Consider using the same algorithm as logical indexing

For those wondering, this is the findall(::BitArray) method, which I hadn't realized does indeed basically the same thing as BitSet iteration. And @chethega implemented a fast algorithn with this _blsr trick in #29746.

EDIT: actually, keeping b=trailing_zeros(word) in the while loop as it was originally, but using _blrs for computing the returned state (instead of masking word with a shifted bit) still results in significant speed-up compared to the originial PR version. So in itself (independantly of "dependency chain" considerations), using _blsr seems more efficient than masking with 1 << b (by an already computed b).

mbauman · 2019-10-28T14:41:53Z

Looks like we do have some BitSet perf tests in the collection group.

@nanosoldier runbenchmarks("collection", vs = ":master")

nanosoldier · 2019-10-28T16:03:25Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

vchuravy · 2019-10-28T18:03:09Z

Would be good to add a benchmark for this.

mbauman · 2019-10-28T18:34:05Z

We just need to actually iterate over the entire container to see any advantage. The current tests just do it once twice.

https://github.com/JuliaCI/BaseBenchmarks.jl/blob/ce11d8badcb0e9f6d8b0518ffd57a472128af930/src/collection/CollectionBenchmarks.jl#L111-L114

rfourquet added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Oct 27, 2019

speedup BitSet iteration (by 2x or 3x)

6f6cc5f

rfourquet force-pushed the rf/bitset-fastiter branch from 74bb775 to 6f6cc5f Compare October 27, 2019 20:28

use _blsr for returned state (@chethega review)

7ae6af4

mbauman approved these changes Oct 28, 2019

View reviewed changes

mbauman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 28, 2019

JeffBezanson merged commit 5e23b84 into master Oct 28, 2019

JeffBezanson deleted the rf/bitset-fastiter branch October 28, 2019 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

speedup BitSet iteration (by 2x or 3x) #33690

speedup BitSet iteration (by 2x or 3x) #33690

Uh oh!

rfourquet commented Oct 27, 2019 •

edited

Loading

Uh oh!

chethega commented Oct 27, 2019 •

edited

Loading

Uh oh!

rfourquet commented Oct 28, 2019 •

edited

Loading

Uh oh!

mbauman commented Oct 28, 2019

Uh oh!

nanosoldier commented Oct 28, 2019

Uh oh!

vchuravy commented Oct 28, 2019

Uh oh!

mbauman commented Oct 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Uh oh!

speedup BitSet iteration (by 2x or 3x) #33690

speedup BitSet iteration (by 2x or 3x) #33690

Uh oh!

Conversation

rfourquet commented Oct 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chethega commented Oct 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rfourquet commented Oct 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbauman commented Oct 28, 2019

Uh oh!

nanosoldier commented Oct 28, 2019

Uh oh!

vchuravy commented Oct 28, 2019

Uh oh!

mbauman commented Oct 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rfourquet commented Oct 27, 2019 •

edited

Loading

chethega commented Oct 27, 2019 •

edited

Loading

rfourquet commented Oct 28, 2019 •

edited

Loading