Skip to content

Conversation

@tkf
Copy link
Member

@tkf tkf commented Mar 7, 2020

This PR implements performance an optimization for foldl on CartesianIndices and product by executing them as nested loops rather than invoking their custom iterate function on a single loop. From this optimization, we can easily add performance optimizations of other functions such as foldl(_, zip(arrays...)) and foreach(_, arrays...). For more contexts, see #9080 (comment), #9080 (comment), and #15648 (comment).

As we already have iterators-to-transducers automatic conversions #33526, iterator comprehensions wrapping product, i.e., anything of the form

(f(x, y, z) for x in xs, y in ys, z in zs if p(x, y, z))

can automatically get some performance boost.

I think this PR also addresses issue #9080.

Benchmarks (issue #9080)

using BenchmarkTools

function sumcart_manual(A::AbstractMatrix)
    s = 0.0
    @inbounds for j = 1:size(A,2), i = 1:size(A,1)
        s += A[i,j]
    end
    s
end

function sumcart_iter(A)
    s = 0.0
    @inbounds for I in CartesianIndices(size(A))
        s += A[I]
    end
    s
end

function sumcart_foldl(A)
    foldl(CartesianIndices(size(A)); init=0.0) do s, I
        @inbounds s + A[I]
    end
end

A = rand(10^4, 10^4);
@btime sumcart_manual($A);  # 126.509 ms (0 allocations: 0 bytes)
@btime sumcart_iter($A);    # 145.124 ms (0 allocations: 0 bytes)
@btime sumcart_foldl($A);   # 125.753 ms (0 allocations: 0 bytes)

Some more benchmarks

using BenchmarkTools
suite = BenchmarkGroup()
suite["sum(x == y for x in xs, y in ys)"] = @benchmarkable sum(
    x == y for x in (1, 2, 3, 4, 5, 6, 7, 8), y in $(rand(1:50, 10^3))
)
suite["sum(x * y for (x, y) in zip(A, transpose(B)))"] = @benchmarkable sum(
    x * y for (x, y) in zip($(rand(100, 100)), transpose($(rand(100, 100))))
)
suite["copyto!(A, transpose(B))"] = @benchmarkable begin
    A = $(zeros(100, 100))
    B = transpose($(rand(100, 100)))
    foreach(eachindex(A, B)) do I
        @inbounds A[I] = B[I]
    end
end

Before (1.5.0-DEV.416):

  "sum(x * y for (x, y) in zip(A, transpose(B)))" => Trial(19.145 μs)
  "sum(x == y for x in xs, y in ys)" => Trial(5.190 μs)
  "copyto!(A, transpose(B))" => Trial(15.972 μs)

After:

  "sum(x * y for (x, y) in zip(A, transpose(B)))" => Trial(10.593 μs)
  "sum(x == y for x in xs, y in ys)" => Trial(616.000 ns)
  "copyto!(A, transpose(B))" => Trial(5.733 μs)

@timholy
Copy link
Member

timholy commented Mar 7, 2020

Very nice addition! I also agree with your choice of language: it "addresses" #9080 but is not a fix for it. We want to be able to support both paradigms efficiently. My guess is the same transformation, unwrapping the loops, needs to be applied to the iterator version, but as you say it's a much harder transformation to do automatically and might require some compiler magic. So it's great to have this.

first(iter::CartesianIndices) = CartesianIndex(map(first, iter.indices))
last(iter::CartesianIndices) = CartesianIndex(map(last, iter.indices))

# Use nested for-loop in `foldl` as it is much faster than `iterate`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... as it preserves LLVM's ability to vectorize.

E.g. don't just state that it is faster, but why

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 37e5971 enough?

base/reduce.jl Outdated
struct _InitialValue end
@inline function _foldl_impl(op::OP, init, array::AbstractArray) where {OP}
if IndexStyle(array) isa IndexLinear
return invoke(_foldl_impl, Tuple{Any,Any,Any}, op, init, array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That we have to use invoke here is a codesmell for me. Why can't this be done with a dispatch on the IndexStyle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just simply renamed the default implementation a56c371. I can wrap it with a new function with IndexStyle argument but I think this is simpler. What do you think?

The reason why I don't think IndexStyle is a good approach here is that this trait is insufficient for supporting more complex collections such as sparse arrays.

@vchuravy
Copy link
Member

bump, I think this shouldn't need to wait until we figure out the generic compiler work.

@vtjnash
Copy link
Member

vtjnash commented Nov 13, 2021

Is this still needed? I see the linked issues are closed now, not a few months after Tim sounded that he despaired of seeing #9080 be solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants