Skip to content

Conversation

@ndinsmore
Copy link
Contributor

This introduces a TypecastStyle interface to the ReinterpretArray methods to make the it simpiler to add some optimizations that were missing. Particularly typecasting to a larger type in a continuous array.

The main changes were to the _getindex_ra & _setindex_ra methods which are now dispatched on based on TypecastStyle. This should make further optimization easier in the future.

This was somewhat inspired by the issue #31305. It does provide a speedup of nearly 10x when you have a bounds checking active with an accessor type function.

The following benchmark code was used:

using BenchmarkTools
v=zeros(UInt8,2048)
v[1:8:2048] .= UInt8(0):UInt8(255)



Base.@propagate_inbounds function unsafe(::Type{T}, v::Vector{S}, i::Integer) where {T,S}
    p = Base.unsafe_convert(Ptr{T}, pointer(v))
    size_ratio = div(sizeof(T),sizeof(S));
    @boundscheck checkbounds(v, i*size_ratio)
    @boundscheck checkbounds(v, i+sizeof(T)-1)
    unsafe_load(p, ((i-1)*size_ratio)+1)
end

Base.@propagate_inbounds function safe(::Type{T}, v::Vector, i::Integer) where {T}
    @inline reinterpret(T, v)[i]
end

function test_accessor(fun,v)
    ret = Int(0)
    for i = 1:(length(v)÷8)
        @inline ret += fun(Int,v,i)
    end
    return ret
end

function test_sum(::Type{T}, v) where { T }
    s = sum(reinterpret(T, v))
    return s
end

function test_dotadd(::Type{T}, v) where { T }
    reinterpret(T, v) .+= 1
    return v
end

The results on master:

julia> @btime safe(Int,$v,1);
  5.355 ns (0 allocations: 0 bytes)

julia> @btime unsafe(Int,$v,1);
  4.573 ns (0 allocations: 0 bytes)

julia> @btime test_accessor($safe,$v);
  2.233 μs (0 allocations: 0 bytes)

julia> @btime test_accessor($unsafe,$v);
  91.353 ns (0 allocations: 0 bytes)

julia> @btime test_sum($UInt64, $v);
  152.654 ns (1 allocation: 16 bytes)

julia> @btime test_dotadd($UInt64, $v);
  149.262 ns (0 allocations: 0 bytes)

vs This branch

julia> @btime safe(Int,$v,1);
  5.989 ns (0 allocations: 0 bytes)

julia> @btime unsafe(Int,$v,1);
  5.614 ns (0 allocations: 0 bytes)

julia> @btime test_accessor($safe,$v);
  175.539 ns (0 allocations: 0 bytes)

julia> @btime test_accessor($unsafe,$v);
  102.789 ns (0 allocations: 0 bytes)

julia> @btime test_sum($UInt64, $v);
  153.058 ns (1 allocation: 16 bytes)

julia> @btime test_dotadd($UInt64, $v);
  151.274 ns (0 allocations: 0 bytes)

@N5N3
Copy link
Member

N5N3 commented Jan 6, 2023

I didn't look into the implement carefully, but the StridedNonReshapeReinterpretArray part looks similar to #44186.
So perhaps you also need to make sure this PR wont break our GPU echosystem.

@N5N3 N5N3 added performance Must go faster arrays [a, r, r, a, y, s] labels Jan 6, 2023
@ndinsmore
Copy link
Contributor Author

You may be right about that particular concern, is there a way to test that?
I think that the key here is that by breaking up the _***index_ra methods it becomes a much more surgical approach to adding optimizations.

@ndinsmore ndinsmore closed this Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrays [a, r, r, a, y, s] performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants