`===` Optimization: Defer recursion into pointees #43658

NHDaly · 2022-01-04T18:45:16Z

Immutable struct comparisons with === can be arbitrarily expensive for
deeply recursive but (almost) equal objects. Whenever possible, it's
valuable to defer the potentially expensive recursion by first comparing
the struct fields for bitwise equality.

Before this commit, two structs are compare elementwise, in the order of
the struct definition, recursing when pointer fields are encountered.

This commit defers the recursion into pointed-to fields until after all
other non-pointer fields of the struct are compared.

This has two advantages:

It defers the expensive part of === comparison as long as possible,
in the hopes that we can exit early from dissimilarities discovered
elsewhere in the struct instances.
It improves cache-locality by scanning the whole struct before
jumping into any new places in memory (and reducing comparisons
needed on the current cache line after returning from the recursive
call).

The drawback is that you'll have to scan the pointer fields again, which
means potentially more cache misses if the struct is very large.

The best way to tell if this is helpful or harmful is benchmarking.

Here is the motivating benchmark, which indeed improves by 10x with this
commit, compared to master:

julia> using BenchmarkTools

julia> struct VN
           val::Float32
           next::Union{VN, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
       end

julia> struct NV
           next::Union{NV, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
           val::Float32
       end

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(rand(Int), head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, rand(Int))
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);

Master:

julia> @btime $vn1 === $vn2
  7.562 ns (0 allocations: 0 bytes)
false

julia> @btime $nv1 === $nv2  # slower, since it recurses pointers unnecessarily
  76.952 μs (0 allocations: 0 bytes)
false

After this commit:

julia> @btime $vn1 === $vn2
  8.597 ns (0 allocations: 0 bytes)
false

julia> @btime $nv1 === $nv2  # We get to skip the recursion and exit early. :)
  10.280 ns (0 allocations: 0 bytes)
false

However, I think that there are probably other benchmarks where it
harms performance, so we'll have to see...

For example, here's one: In the exact opposite case as above, if the two
objects are (almost) equal, necessitating checking every object, the
NV comparisons could have exited after all the recursive pointer
checks, and never compare the fields, whereas now the fields are checked
first, so this gets slower.

I'm not exactly sure why the VN comparisons get somewhat slower too,
but it's maybe because of the second scan mentioned above.

julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, i)
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(i, head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);

Master:

julia> @btime $vn1 === $vn2
  95.996 μs (0 allocations: 0 bytes)
false

julia> @btime $nv1 === $nv2
  82.192 μs (0 allocations: 0 bytes)
false

This commit:

julia> @btime $vn1 === $vn2
  127.512 μs (0 allocations: 0 bytes)
false

julia> @btime $nv1 === $nv2
  126.837 μs (0 allocations: 0 bytes)

We stumbled across this potential optimization while reading through the code for compare_fields(). Hopefully it's beneficial, but we'll see! :)

Co-Authored-By: @nystrom

NHDaly · 2022-01-04T18:54:19Z

@nanosoldier runbenchmarks(ALL, vs="@f8f42ec0370b72072498c1b79eb5cf4e6c6864e6")

KristofferC · 2022-01-04T19:33:10Z

@nanosoldier runbenchmarks(ALL, vs="@f8f42ec0370b72072498c1b79eb5cf4e6c6864e6")

JeffBezanson · 2022-01-04T20:42:50Z

Good idea, I like it! This sort of feels like the right way to do it to me, even if it's a tad slower in some cases.

I wonder if it would help to ignore pointer fields completely in the first pass, instead of doing the if (af != bf && (af == NULL || bf == NULL)) check?

nanosoldier · 2022-01-05T04:22:14Z

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/data/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

NHDaly · 2022-01-05T16:56:58Z

Yay, thanks! 🎉

I wonder if it would help to ignore pointer fields completely in the first pass, instead of doing the if (af != bf && (af == NULL || bf == NULL)) check?

Yeah, i had that idea too, and I could go either way. I chose to do it this way, because it's possible the NULL/NULL check could also allow us to return early: If a struct has 2 pointer fields and the second one is NULL in a but not b, you could avoid the recursive call to compare the the first pointee if you had seen the NULL pointer first.

So it's another instance of front-loading checking the bits in the struct before recursing to pointees. So i think probably better to do it this way, since it's making the same tradeoff as the rest of the PR. Does that make sense to you, too?

NHDaly · 2022-01-05T16:58:01Z

Anyone know why Nanosolider failed to upload the benchmark results? Is it because the PR is a draft or something? I'll mark it ready for review now.

KristofferC · 2022-01-05T17:09:46Z

Anyone know why Nanosolider failed to upload the benchmark results? Is it because the PR is a draft or something?

The size has just gotten too big and no one has fixed it.

JeffBezanson · 2022-01-05T18:53:13Z

NULL fields are rare, so I think not worth special-casing. Actual timing is what matters though.

JeffBezanson · 2022-01-05T19:16:26Z

I get these results for your last benchmark (where the PR is slower):

before PR:
132.834 μs (0 allocations: 0 bytes)
124.298 μs (0 allocations: 0 bytes)

PR:
192.262 μs (0 allocations: 0 bytes)
193.349 μs (0 allocations: 0 bytes)

PR + ignore pointer fields on the first pass:
180.625 μs (0 allocations: 0 bytes)
183.335 μs (0 allocations: 0 bytes)

So there is a slight improvement (and looks like I need a new laptop again 😂 )

NHDaly · 2022-01-06T15:03:14Z

Okay cool, makes sense!

I think the "NULL fields are rare, so I think not worth special-casing." argument is strong here, so that makes sense. Thanks! 👍 👍

I can make that change.

But first: So is it possible to see the results of the nanosoldier run? Or are they lost to the mysteries of the universe? It would be nice to see if this proves to be good or bad, and I'd like to be able to have these numbers before making any changes so we can compare?

Thanks!

KristofferC · 2022-01-06T15:05:59Z

@vtjnash has to manually retrieve them.

vtjnash · 2022-01-10T16:32:33Z

Oops, sorry, I am behind on email. Here you go:
https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/cffd656_vs_f8f42ec/report.md

NHDaly · 2022-01-10T17:10:06Z

Thanks @vtjnash!

What do the percentages mean?

It looks like some got quite a bit better and some quite a bit worse, but in ways that don't exactly make sense to me... So I feel like probably this is mostly just noisy? Can any of you with more experience reading these weigh in?

I'll give your suggestion a shot now, @JeffBezanson.

NHDaly · 2022-01-10T17:15:44Z

@nanosoldier runbenchmarks(ALL, vs="@5449d1bfabdaeeb321c179a8344dc2852a989764")

oscardssmith · 2022-01-10T18:00:43Z

Nanosoldier (especially recently) is pretty noisy.

KristofferC · 2022-01-10T21:03:48Z

@NHDaly, you need to code quote the part after the nanosoldier invocation.

oscardssmith · 2022-01-10T21:05:56Z

@nanosoldier runbenchmarks(ALL, vs="@5449d1bfabdaeeb321c179a8344dc2852a989764")

nanosoldier · 2022-01-11T07:21:35Z

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/data/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

NHDaly · 2022-01-11T15:48:52Z

Oops, thanks @KristofferC. I did that right the first time (but also it didn't work then either.. 🤔 maybe i don't have permissions or something).

Thanks @oscardssmith and @KristofferC

oscardssmith · 2022-01-11T15:57:56Z

it did work, it's just that there's a nanosoldier bug that means @vtjnash needs to post the log manually.

vtjnash · 2022-01-11T20:49:19Z

I don't know if this is the comparison you wanted, since there are a lot of unrelated commits in that command, but here you go: https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/18eef47_vs_5449d1b/report.md

vtjnash · 2022-01-24T23:37:09Z

@nanosoldier runbenchmarks(ALL, vs="@master")

vtjnash · 2022-01-24T23:39:17Z

For more impact, we may want to update codegen.cpp to do the same ordering change as here also for emit_bits_compare

nanosoldier · 2022-01-25T06:32:40Z

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

vtjnash · 2022-01-26T16:29:05Z

https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/3eac22a_vs_master/report.md

Still not what I mean to run though: @nanosoldier runbenchmarks(!"scalar", vs=":master")

nanosoldier · 2022-01-26T21:43:27Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

NHDaly · 2022-01-27T15:03:08Z

I don't know if this is the comparison you wanted, since there are a lot of unrelated commits in that command, but here you go: https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/18eef47_vs_5449d1b/report.md

Oh, huh, maybe I don't exactly undrestand how Nanosoldier works. I think i was trying to compare the latest commit on this branch against the previous commit. Did that not do that?

NHDaly · 2022-01-27T15:05:09Z

For more impact, we may want to update codegen.cpp to do the same ordering change as here also for emit_bits_compare

Yeah, makes sense. Good idea! I don't really feel like I have the chops to do that.. should it be done in a separate PR? Or could someone help with that here?

Also, i really don't know what to make of these Nanosoldier results.. Some things appear to get quite a bit better, and some quite a bit worse, but it's hard to tell what's just noise? :(

DilumAluthge · 2022-01-30T21:39:08Z

@NHDaly Can you rebase on the latest master? That should fix the llvmpasses failure.

Immutable struct comparisons with `===` can be arbitrarily expensive for deeply recursive but (almost) equal objects. Whenever possible, it's valuable to defer the potentially expensive recursion by first comparing the struct fields for bitwise equality. Before this commit, two structs are compare elementwise, in the order of the struct definition, recursing when pointer fields are encountered. This commit defers the recursion into pointed-to fields until after all other non-pointer fields of the struct are compared. This has two advantages: 1. It defers the expensive part of === comparison as long as possible, in the hopes that we can exit early from dissimilarities discovered elsewhere in the struct instances. 2. It improves cache-locality by scanning the whole struct before jumping into any new places in memory (and reducing comparisons needed on the current cache line after returning from the recursive call). The drawback is that you'll have to scan the pointer fields again, which means potentially more cache misses if the struct is very large. The best way to tell if this is helpful or harmful is benchmarking. Here is the motivating benchmark, which indeed improves by 10x with this commit, compared to master: ```julia julia> using BenchmarkTools julia> struct VN val::Float32 next::Union{VN, Array} # put a mutable thing at the end, to prevent runtime from sharing instances end julia> struct NV next::Union{NV, Array} # put a mutable thing at the end, to prevent runtime from sharing instances val::Float32 end julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(rand(Int), head) end return head end make_chain_vn (generic function with 1 method) julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, rand(Int)) end return head end make_chain_nv (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 7.562 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # slower, since it recurses pointers unnecessarily 76.952 μs (0 allocations: 0 bytes) false ``` After this commit: ``` julia> @Btime $vn1 === $vn2 8.597 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # We get to skip the recursion and exit early. :) 10.280 ns (0 allocations: 0 bytes) false ``` However, I think that there are probably other benchmarks where it harms performance, so we'll have to see... For example, here's one: In the exact opposite case as above, if the two objects _are_ (almost) equal, necessitating checking every object, the `NV` comparisons could have exited after all the recursive pointer checks, and never compare the fields, whereas now the fields are checked first, so this gets slower. I'm not exactly sure why the `VN` comparisons get somewhat slower too, but it's maybe because of the second scan mentioned above. ```julia julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, i) end return head end make_chain_nv (generic function with 1 method) julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(i, head) end return head end make_chain_vn (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 95.996 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 82.192 μs (0 allocations: 0 bytes) false ``` This commit: ``` julia> @Btime $vn1 === $vn2 127.512 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 126.837 μs (0 allocations: 0 bytes) ```

Delay the nullptr checks until all non-ptr fields have been compared, since we have to go back to those anyways to follow the pointers. Co-authored-by:Jeff Bezanson <[email protected]>

NHDaly · 2022-01-31T19:52:05Z

Done, thanks

* jl_egal Optimization: Defer recursion into pointees Immutable struct comparisons with `===` can be arbitrarily expensive for deeply recursive but (almost) equal objects. Whenever possible, it's valuable to defer the potentially expensive recursion by first comparing the struct fields for bitwise equality. Before this commit, two structs are compare elementwise, in the order of the struct definition, recursing when pointer fields are encountered. This commit defers the recursion into pointed-to fields until after all other non-pointer fields of the struct are compared. This has two advantages: 1. It defers the expensive part of === comparison as long as possible, in the hopes that we can exit early from dissimilarities discovered elsewhere in the struct instances. 2. It improves cache-locality by scanning the whole struct before jumping into any new places in memory (and reducing comparisons needed on the current cache line after returning from the recursive call). The drawback is that you'll have to scan the pointer fields again, which means potentially more cache misses if the struct is very large. The best way to tell if this is helpful or harmful is benchmarking. Here is the motivating benchmark, which indeed improves by 10x with this commit, compared to master: ```julia julia> using BenchmarkTools julia> struct VN val::Float32 next::Union{VN, Array} # put a mutable thing at the end, to prevent runtime from sharing instances end julia> struct NV next::Union{NV, Array} # put a mutable thing at the end, to prevent runtime from sharing instances val::Float32 end julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(rand(Int), head) end return head end make_chain_vn (generic function with 1 method) julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, rand(Int)) end return head end make_chain_nv (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 7.562 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # slower, since it recurses pointers unnecessarily 76.952 μs (0 allocations: 0 bytes) false ``` After this commit: ``` julia> @Btime $vn1 === $vn2 8.597 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # We get to skip the recursion and exit early. :) 10.280 ns (0 allocations: 0 bytes) false ``` However, I think that there are probably other benchmarks where it harms performance, so we'll have to see... For example, here's one: In the exact opposite case as above, if the two objects _are_ (almost) equal, necessitating checking every object, the `NV` comparisons could have exited after all the recursive pointer checks, and never compare the fields, whereas now the fields are checked first, so this gets slower. I'm not exactly sure why the `VN` comparisons get somewhat slower too, but it's maybe because of the second scan mentioned above. ```julia julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, i) end return head end make_chain_nv (generic function with 1 method) julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(i, head) end return head end make_chain_vn (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 95.996 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 82.192 μs (0 allocations: 0 bytes) false ``` This commit: ``` julia> @Btime $vn1 === $vn2 127.512 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 126.837 μs (0 allocations: 0 bytes) ``` * Ignore pointer fields completely in the first pass of === Delay the nullptr checks until all non-ptr fields have been compared, since we have to go back to those anyways to follow the pointers. Co-authored-by:Jeff Bezanson <[email protected]>

KristofferC · 2022-03-23T16:06:43Z

Explicitly referencing #44712 from here.

NHDaly added the performance Must go faster label Jan 4, 2022

NHDaly changed the title ~~jl_egal Optimization: Defer recursion into pointees~~ === Optimization: Defer recursion into pointees Jan 4, 2022

NHDaly marked this pull request as ready for review January 5, 2022 16:57

vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Jan 26, 2022

DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Jan 30, 2022

NHDaly added 2 commits January 31, 2022 14:51

Ignore pointer fields completely in the first pass of ===

ba718e8

Delay the nullptr checks until all non-ptr fields have been compared, since we have to go back to those anyways to follow the pointers. Co-authored-by:Jeff Bezanson <[email protected]>

NHDaly force-pushed the nhd-jl_egal-defer-ptr-recursion branch from 7d8c933 to ba718e8 Compare January 31, 2022 19:51

vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Jan 31, 2022

DilumAluthge merged commit d7fd3b7 into master Feb 8, 2022

DilumAluthge deleted the nhd-jl_egal-defer-ptr-recursion branch February 8, 2022 21:18

DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Feb 8, 2022

KristofferC mentioned this pull request Mar 23, 2022

Regression in === (caused by #43658?) #44712

Closed

Uh oh!

=== Optimization: Defer recursion into pointees #43658

=== Optimization: Defer recursion into pointees #43658

Uh oh!

Conversation

NHDaly commented Jan 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NHDaly commented Jan 4, 2022

Uh oh!

KristofferC commented Jan 4, 2022

Uh oh!

JeffBezanson commented Jan 4, 2022

Uh oh!

nanosoldier commented Jan 5, 2022

Uh oh!

NHDaly commented Jan 5, 2022

Uh oh!

NHDaly commented Jan 5, 2022

Uh oh!

KristofferC commented Jan 5, 2022

Uh oh!

JeffBezanson commented Jan 5, 2022

Uh oh!

JeffBezanson commented Jan 5, 2022

Uh oh!

NHDaly commented Jan 6, 2022

Uh oh!

KristofferC commented Jan 6, 2022

Uh oh!

vtjnash commented Jan 10, 2022

Uh oh!

NHDaly commented Jan 10, 2022

Uh oh!

NHDaly commented Jan 10, 2022

Uh oh!

oscardssmith commented Jan 10, 2022

Uh oh!

KristofferC commented Jan 10, 2022

Uh oh!

oscardssmith commented Jan 10, 2022

Uh oh!

nanosoldier commented Jan 11, 2022

Uh oh!

NHDaly commented Jan 11, 2022

Uh oh!

oscardssmith commented Jan 11, 2022

Uh oh!

vtjnash commented Jan 11, 2022

Uh oh!

vtjnash commented Jan 24, 2022

Uh oh!

vtjnash commented Jan 24, 2022

Uh oh!

nanosoldier commented Jan 25, 2022

Uh oh!

vtjnash commented Jan 26, 2022

Uh oh!

nanosoldier commented Jan 26, 2022

Uh oh!

NHDaly commented Jan 27, 2022

Uh oh!

NHDaly commented Jan 27, 2022

Uh oh!

DilumAluthge commented Jan 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NHDaly commented Jan 31, 2022

Uh oh!

KristofferC commented Mar 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

`===` Optimization: Defer recursion into pointees #43658

`===` Optimization: Defer recursion into pointees #43658

NHDaly commented Jan 4, 2022 •

edited

Loading

DilumAluthge commented Jan 30, 2022 •

edited

Loading