Skip to content

Conversation

@timholy
Copy link
Member

@timholy timholy commented Oct 3, 2013

Since many of the algorithms in base working on AbstractArrays use linear indexing, it seemed reasonable to write hand-tuned linear-indexing for the most commonly-used dimensions. This is still much slower than cartesian indexing, but a big improvement over what we have now. In 5 dimensions the improvement is something like 40-fold, as shown below.

Test script:

function linearfill!(A::AbstractArray)
    for i = 1:length(A)
        A[i] = i
    end
    A
end

A2 = zeros(1000,1000)
B2 = sub(A2, :, :)
linearfill!(A2)
print("Array 2d: ")
@time linearfill!(A2)
Ac = copy(A2)
linearfill!(B2)
print("SubArray 2d: ")
@time linearfill!(B2)
@assert A2 == Ac

...

and so forth up through 5 dimensions (always with 10^6 total elements).

Before this patch:

julia> include("/tmp/test_linear.jl")
Array 2d: elapsed time: 0.001823028 seconds (48 bytes allocated)
SubArray 2d: elapsed time: 0.025735556 seconds (64 bytes allocated)
Array 3d: elapsed time: 0.001867426 seconds (48 bytes allocated)
SubArray 3d: elapsed time: 0.372512698 seconds (103983712 bytes allocated)
Array 4d: elapsed time: 0.002026042 seconds (48 bytes allocated)
SubArray 4d: elapsed time: 2.425155532 seconds (521105744 bytes allocated)
Array 5d: elapsed time: 0.001645669 seconds (48 bytes allocated)
SubArray 5d: elapsed time: 3.028932076 seconds (662305792 bytes allocated)

In 5d, it's approximately 2000-fold slower to use a SubArray.

With this patch:

julia> include("/tmp/test_linear.jl")
Array 2d: elapsed time: 0.00195559 seconds (6896 bytes allocated)
SubArray 2d: elapsed time: 0.027324321 seconds (64 bytes allocated)
Array 3d: elapsed time: 0.001553236 seconds (48 bytes allocated)
SubArray 3d: elapsed time: 0.047140215 seconds (64 bytes allocated)
Array 4d: elapsed time: 0.001706639 seconds (48 bytes allocated)
SubArray 4d: elapsed time: 0.064958196 seconds (64 bytes allocated)
Array 5d: elapsed time: 0.001624899 seconds (48 bytes allocated)
SubArray 5d: elapsed time: 0.086291385 seconds (112 bytes allocated)

Now in 5d it's a mere 50-fold slower to use a SubArray. The main remaining issue is that div is approximately ten-fold more expensive than * and /.

@JeffBezanson
Copy link
Member

Is div really 10x slower than /? I tried a quick microbenchmark and it seems to be about 30% slower.

@kmsquire
Copy link
Member

kmsquire commented Oct 4, 2013

Might be worthwhile to add support for C's div function, which is supposedly gives a single instruction for calculating the quotient and remainder on some architectures (such as x86).

See: http://stackoverflow.com/questions/3895081/divide-and-get-remainder-at-the-same-time, http://en.cppreference.com/w/c/numeric/math/div

@timholy
Copy link
Member Author

timholy commented Oct 4, 2013

Hmm, you're right. I should have said that the profiler assigns that kind of blame to div (on my machine), but it must have been that statements were reordered and so it was getting blamed unfairly.

It is a bit of a mystery, then, why linear indexing is still so much slower even with this patch.

@kmsquire, if it meant a call out to libm, in theory it seems like it should be faster to do it in Julia because it can be inlined. But of course that guess could be wrong.

@kmsquire
Copy link
Member

kmsquire commented Oct 5, 2013

Okay, I can see that a call out to libm might be slower.

A better suggestion, then, is to add direct support to julia for divmod, which seems to be supported by llvm, and on x86, at least, is a single instruction (plus setup).

@JeffBezanson
Copy link
Member

Yes, we can add divmod, particularly after Keno's patch that makes tuples faster.

@kmsquire kmsquire mentioned this pull request Oct 5, 2013
@timholy
Copy link
Member Author

timholy commented Oct 8, 2013

Given Jeff's observation about div, I think it's really quite a puzzle why this SubArray linear indexing, even with this patch, is still so dang slow. So I decided to compare against C. A version of the same test, in both Julia and C, is here. I tried to arrange it so that the two were as comparable as possible, working on the same number of elements and performing the same computations.

Timing results, C (compiled with gcc -O2 gtindex.c -o gtindex):

tim@diva:/tmp$ ./gtindex 
Array: 1 seconds 390 milliseconds, sm = 4.99785e+08
SubArray: 1 seconds 390 milliseconds, sm = 4.99785e+08

Timing results, Julia:

julia> include("/tmp/gtindex.jl")
Array 2d: elapsed time: 1.814084096 seconds (83112 bytes allocated)
SubArray 2d: elapsed time: 21.909474699 seconds (99272 bytes allocated)

Not too bad on the Array, but quite horrible on the SubArray.

I'm still not great at reading assembly, so I thought I'd post the two machine codes (for the C version, I'm just assuming I've got the whole thing, but I'm not certain).

C (with gcc -O2 -S -masm=intel gtindex.c):

        .file   "gtindex.c"
        .intel_syntax noprefix
        .text
        .p2align 4,,15
        .globl  gtindex_sub
        .type   gtindex_sub, @function
gtindex_sub:
.LFB34:
        .cfi_startproc
        mov     r10d, DWORD PTR [rsp+8]
        sub     r10d, 1
        mov     edx, r10d
        mov     eax, r10d
        sar     edx, 31
        idiv    esi
        imul    esi, eax
        imul    eax, r9d
        sub     r10d, esi
        imul    r10d, r8d
        add     r10d, eax
        add     ecx, r10d
        movsx   rcx, ecx
        movsd   xmm0, QWORD PTR [rdi+rcx*8]
        ret
        .cfi_endproc

Julia:

julia> code_native(gtindex, (typeof(B), Int))
        .text
Filename: /tmp/gtindex.jl
Source line: 2
        push    RBP
        mov     RBP, RSP
        push    R14
        push    RBX
        mov     R14, RSI
        mov     RBX, RDI
Source line: 2
        movabs  RAX, 139752630807808
        mov     ESI, 1
        call    RAX
        cmp     RAX, -1
Source line: 4
        setne   CL
        movabs  RDX, -9223372036854775807
        cmp     R14, RDX
        setne   DL
Source line: 2
        test    RAX, RAX
        je      40
Source line: 4
        or      CL, DL
        je      32
Source line: 135
        cmp     QWORD PTR [RBX + 8], 0
        je      51
        cmp     QWORD PTR [RBX + 32], 0
        je      40
        mov     EAX, 26768960
Source line: 136
        pop     RBX
        pop     R14
        pop     RBP
        ret
Source line: 4
        movabs  RAX, 139752623343568
        mov     RDI, QWORD PTR [RAX]
        movabs  RAX, 139752609334176
        mov     ESI, 4
        call    RAX
Source line: 135
        movabs  RAX, 139752623343536
        mov     RDI, QWORD PTR [RAX]
        movabs  RAX, 139752609334176
        mov     ESI, 135
        call    RAX

The main thing I noticed are the call statements.

@simonster
Copy link
Member

One of the calls is for size, which can't get inlined because of the ternary operator in abstractarray.jl:8. I think the others are for exceptions and are probably never reached.

@timholy
Copy link
Member Author

timholy commented Oct 9, 2013

Thanks, @simonster; I was just about to dig into this myself. I've updated my PR to avoid the call to size, and it does make a modest difference.

Certainly one gap between the array performance and subarray performance is inlining. In playing with this I discovered that

function gtindx(A::Array, i::Integer)
    A[i]
end

gets inlined but even as simple a change as

function gtindx2(A::Array, i::Integer)
    ret = A[i]
end

does not. Hopefully #3796 will fix this. Based on my tests this may account for about a factor of 5 in performance, but in 5d we still have an additional factor of 8 that does not make a lot of sense.

One of the things that I noticed is that after each div, there is a sequence of about 5 boolean operations (comparison of div result to typemin(Int), comparison of the stride value to both 0 and -1, and some ands/ors of the results) followed by a jump (the je instruction). This jump is in the assembly, not just the LLVM, and in 5d there are 5 of them.

There turns out to be a more than two-fold difference in the execution time of code that calls this function:

function gtindex{T}(s::SubArray{T,5}, ind::Integer)
    @inbounds strd2 = s.dims[1]
    @inbounds strd3 = strd2*s.dims[2]
    @inbounds strd4 = strd3*s.dims[3]
    @inbounds strd5 = strd4*s.dims[4]
    ind -= 1
    ind0 = ind
    i5 = div(ind,strd5)
#    ind -= i5*strd5
    i4 = div(ind,strd4)
    i3 = div(ind,strd3)
    i2 = div(ind,strd2)
    @inbounds ret = s.parent[ind0]
    ret
end

depending on whether that one line is commented out! (Note that in neither case does the result of the function depend on what happens to ind, so presumably this is not a case of LLVM optimizing something away.) This indeed suggests that the div calls themselves are not that awful, but something about the result of the div is hurting performance. Perhaps indeed divmod and friends is the answer, if using it would avoid triggering this problem.

@timholy
Copy link
Member Author

timholy commented Oct 9, 2013

To elaborate on the last point: it seems completely nonsensical that an extra multiply and subtract should more than double the execution time of an algorithm that has 13 other arithmetic operations and is generating a cache miss on every 8th call. The C version demonstrates that the cache misses should be the only thing that matters: in terms of performance, the C subarray version is indistinguishable from the C plain-array version, so all these arithmetic operations must have negligible impact on performance (they are local to the CPU). Yet they dominate the performance for the Julia version. Something is fishy.

JeffBezanson added a commit that referenced this pull request Oct 23, 2013
Faster linear indexing for SubArrays, dims 1-5
@JeffBezanson JeffBezanson merged commit 81c4e1d into master Oct 23, 2013
@timholy timholy deleted the teh/subarray_linear branch November 4, 2013 11:24
timholy referenced this pull request Dec 2, 2013
I'm not even sure why we were generating code to check for a zero
denominator – this seems to still raise an exception as expected.
The other special case just isn't worth it to emit more than one
instruction for div – and div/rem since x86 does both together.
tecosaur added a commit to tecosaur/julia that referenced this pull request Oct 19, 2025
Contains the following commits:
5bfd1161e * STDLIBS_BY_VERSION: Check sorted & require same minor (JuliaLang#4414)
ce986129c * Fix completion on empty command (JuliaLang#4418)
8d74d35d1 * update SHA compat (JuliaLang#4436)
14c5ae327 * add a docstring to Registry module (JuliaLang#4432)
bbb9e6d23 * allow `generate` into an empty directory (JuliaLang#4430)
a1818b9a9 * Drop the REPL search keymap (JuliaLang#4425)
baa7981c7 * do not try to update registries in an unwritable folder (JuliaLang#4429)
01690b54b * make `.path` field consistently be relative manifest and convert to project relative upon writing to a project file (JuliaLang#4427)
3306ed522 * Deduplicate suggestions in package completions (JuliaLang#4431)
tecosaur added a commit to tecosaur/julia that referenced this pull request Oct 20, 2025
Contains the following commits:
5bfd1161e * STDLIBS_BY_VERSION: Check sorted & require same minor (JuliaLang#4414)
ce986129c * Fix completion on empty command (JuliaLang#4418)
8d74d35d1 * update SHA compat (JuliaLang#4436)
14c5ae327 * add a docstring to Registry module (JuliaLang#4432)
bbb9e6d23 * allow `generate` into an empty directory (JuliaLang#4430)
a1818b9a9 * Drop the REPL search keymap (JuliaLang#4425)
baa7981c7 * do not try to update registries in an unwritable folder (JuliaLang#4429)
01690b54b * make `.path` field consistently be relative manifest and convert to project relative upon writing to a project file (JuliaLang#4427)
3306ed522 * Deduplicate suggestions in package completions (JuliaLang#4431)
tecosaur added a commit to tecosaur/julia that referenced this pull request Oct 20, 2025
Contains the following commits:
5bfd1161e * STDLIBS_BY_VERSION: Check sorted & require same minor (JuliaLang#4414)
ce986129c * Fix completion on empty command (JuliaLang#4418)
8d74d35d1 * update SHA compat (JuliaLang#4436)
14c5ae327 * add a docstring to Registry module (JuliaLang#4432)
bbb9e6d23 * allow `generate` into an empty directory (JuliaLang#4430)
a1818b9a9 * Drop the REPL search keymap (JuliaLang#4425)
baa7981c7 * do not try to update registries in an unwritable folder (JuliaLang#4429)
01690b54b * make `.path` field consistently be relative manifest and convert to project relative upon writing to a project file (JuliaLang#4427)
3306ed522 * Deduplicate suggestions in package completions (JuliaLang#4431)
tecosaur added a commit to tecosaur/julia that referenced this pull request Oct 20, 2025
Contains the following commits:
5bfd1161e * STDLIBS_BY_VERSION: Check sorted & require same minor (JuliaLang#4414)
ce986129c * Fix completion on empty command (JuliaLang#4418)
8d74d35d1 * update SHA compat (JuliaLang#4436)
14c5ae327 * add a docstring to Registry module (JuliaLang#4432)
bbb9e6d23 * allow `generate` into an empty directory (JuliaLang#4430)
a1818b9a9 * Drop the REPL search keymap (JuliaLang#4425)
baa7981c7 * do not try to update registries in an unwritable folder (JuliaLang#4429)
01690b54b * make `.path` field consistently be relative manifest and convert to project relative upon writing to a project file (JuliaLang#4427)
3306ed522 * Deduplicate suggestions in package completions (JuliaLang#4431)
tecosaur added a commit to tecosaur/julia that referenced this pull request Oct 20, 2025
Contains the following commits:
5bfd1161e * STDLIBS_BY_VERSION: Check sorted & require same minor (JuliaLang#4414)
ce986129c * Fix completion on empty command (JuliaLang#4418)
8d74d35d1 * update SHA compat (JuliaLang#4436)
14c5ae327 * add a docstring to Registry module (JuliaLang#4432)
bbb9e6d23 * allow `generate` into an empty directory (JuliaLang#4430)
a1818b9a9 * Drop the REPL search keymap (JuliaLang#4425)
baa7981c7 * do not try to update registries in an unwritable folder (JuliaLang#4429)
01690b54b * make `.path` field consistently be relative manifest and convert to project relative upon writing to a project file (JuliaLang#4427)
3306ed522 * Deduplicate suggestions in package completions (JuliaLang#4431)
tecosaur added a commit to tecosaur/julia that referenced this pull request Oct 20, 2025
Contains the following commits:
5bfd1161e * STDLIBS_BY_VERSION: Check sorted & require same minor (JuliaLang#4414)
ce986129c * Fix completion on empty command (JuliaLang#4418)
8d74d35d1 * update SHA compat (JuliaLang#4436)
14c5ae327 * add a docstring to Registry module (JuliaLang#4432)
bbb9e6d23 * allow `generate` into an empty directory (JuliaLang#4430)
a1818b9a9 * Drop the REPL search keymap (JuliaLang#4425)
baa7981c7 * do not try to update registries in an unwritable folder (JuliaLang#4429)
01690b54b * make `.path` field consistently be relative manifest and convert to project relative upon writing to a project file (JuliaLang#4427)
3306ed522 * Deduplicate suggestions in package completions (JuliaLang#4431)
tecosaur added a commit to tecosaur/julia that referenced this pull request Oct 20, 2025
Contains the following commits:
5bfd1161e * STDLIBS_BY_VERSION: Check sorted & require same minor (JuliaLang#4414)
ce986129c * Fix completion on empty command (JuliaLang#4418)
8d74d35d1 * update SHA compat (JuliaLang#4436)
14c5ae327 * add a docstring to Registry module (JuliaLang#4432)
bbb9e6d23 * allow `generate` into an empty directory (JuliaLang#4430)
a1818b9a9 * Drop the REPL search keymap (JuliaLang#4425)
baa7981c7 * do not try to update registries in an unwritable folder (JuliaLang#4429)
01690b54b * make `.path` field consistently be relative manifest and convert to project relative upon writing to a project file (JuliaLang#4427)
3306ed522 * Deduplicate suggestions in package completions (JuliaLang#4431)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants