Skip to content

Conversation

@cuviper
Copy link
Member

@cuviper cuviper commented Jan 29, 2021

We can avoid any stack-local nodes entirely by using Box::new_uninit, and since the nodes are mostly MaybeUninit fields, we only need a couple of actual writes before assume_init. This should help with the stack overflows in #81444, and may also improve performance in general.

r? @Mark-Simulacrum
cc @ssomers

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 29, 2021
@cuviper
Copy link
Member Author

cuviper commented Jan 29, 2021

Benchmark results on Ryzen 7 3800X running Fedora 33
 name                                           bench.master ns/iter  bench.node-init ns/iter  diff ns/iter   diff %  speedup 
 btree::map::clone_fat_100                      110,276               76,469                        -33,807  -30.66%   x 1.44 
 btree::map::clone_fat_100_and_clear            109,863               76,634                        -33,229  -30.25%   x 1.43 
 btree::map::clone_fat_100_and_drain_all        175,486               139,632                       -35,854  -20.43%   x 1.26 
 btree::map::clone_fat_100_and_drain_half       150,244               120,815                       -29,429  -19.59%   x 1.24 
 btree::map::clone_fat_100_and_into_iter        126,147               90,100                        -36,047  -28.58%   x 1.40 
 btree::map::clone_fat_100_and_pop_all          185,214               149,954                       -35,260  -19.04%   x 1.24 
 btree::map::clone_fat_100_and_remove_all       209,143               175,649                       -33,494  -16.01%   x 1.19 
 btree::map::clone_fat_100_and_remove_half      163,341               134,098                       -29,243  -17.90%   x 1.22 
 btree::map::clone_fat_val_100                  51,248                34,809                        -16,439  -32.08%   x 1.47 
 btree::map::clone_fat_val_100_and_clear        51,594                34,365                        -17,229  -33.39%   x 1.50 
 btree::map::clone_fat_val_100_and_drain_all    76,033                59,356                        -16,677  -21.93%   x 1.28 
 btree::map::clone_fat_val_100_and_drain_half   66,150                52,769                        -13,381  -20.23%   x 1.25 
 btree::map::clone_fat_val_100_and_into_iter    57,973                40,969                        -17,004  -29.33%   x 1.42 
 btree::map::clone_fat_val_100_and_pop_all      80,080                63,314                        -16,766  -20.94%   x 1.26 
 btree::map::clone_fat_val_100_and_remove_all   85,727                69,253                        -16,474  -19.22%   x 1.24 
 btree::map::clone_fat_val_100_and_remove_half  68,964                55,972                        -12,992  -18.84%   x 1.23 
 btree::map::clone_slim_100                     1,423                 1,323                            -100   -7.03%   x 1.08 
 btree::map::clone_slim_100_and_clear           1,431                 1,328                            -103   -7.20%   x 1.08 
 btree::map::clone_slim_100_and_drain_all       2,752                 2,612                            -140   -5.09%   x 1.05 
 btree::map::clone_slim_100_and_drain_half      2,312                 2,102                            -210   -9.08%   x 1.10 
 btree::map::clone_slim_100_and_into_iter       1,428                 1,070                            -358  -25.07%   x 1.33 
 btree::map::clone_slim_100_and_pop_all         2,981                 2,798                            -183   -6.14%   x 1.07 
 btree::map::clone_slim_100_and_remove_all      4,183                 4,026                            -157   -3.75%   x 1.04 
 btree::map::clone_slim_100_and_remove_half     2,360                 2,312                             -48   -2.03%   x 1.02 
 btree::map::clone_slim_10k                     187,223               179,762                        -7,461   -3.99%   x 1.04 
 btree::map::clone_slim_10k_and_clear           187,103               178,952                        -8,151   -4.36%   x 1.05 
 btree::map::clone_slim_10k_and_drain_all       328,127               309,945                       -18,182   -5.54%   x 1.06 
 btree::map::clone_slim_10k_and_drain_half      281,123               272,728                        -8,395   -2.99%   x 1.03 
 btree::map::clone_slim_10k_and_into_iter       187,027               150,913                       -36,114  -19.31%   x 1.24 
 btree::map::clone_slim_10k_and_pop_all         336,710               324,753                       -11,957   -3.55%   x 1.04 
 btree::map::clone_slim_10k_and_remove_all      508,120               490,692                       -17,428   -3.43%   x 1.04 
 btree::map::clone_slim_10k_and_remove_half     455,783               430,063                       -25,720   -5.64%   x 1.06 
 btree::map::find_rand_100                      9                     10                                  1   11.11%   x 0.90 
 btree::map::find_rand_10_000                   45                    45                                  0    0.00%   x 1.00 
 btree::map::find_seq_100                       10                    10                                  0    0.00%   x 1.00 
 btree::map::find_seq_10_000                    30                    30                                  0    0.00%   x 1.00 
 btree::map::first_and_last_0                   9                     9                                   0    0.00%   x 1.00 
 btree::map::first_and_last_100                 35                    33                                 -2   -5.71%   x 1.06 
 btree::map::first_and_last_10k                 54                    54                                  0    0.00%   x 1.00 
 btree::map::insert_rand_100                    39                    38                                 -1   -2.56%   x 1.03 
 btree::map::insert_rand_10_000                 39                    38                                 -1   -2.56%   x 1.03 
 btree::map::insert_seq_100                     44                    41                                 -3   -6.82%   x 1.07 
 btree::map::insert_seq_10_000                  91                    87                                 -4   -4.40%   x 1.05 
 btree::map::iter_0                             685                   688                                 3    0.44%   x 1.00 
 btree::map::iter_1                             12,947                12,923                            -24   -0.19%   x 1.00 
 btree::map::iter_100                           13,089                13,382                            293    2.24%   x 0.98 
 btree::map::iter_10k                           13,630                13,843                            213    1.56%   x 0.98 
 btree::map::iter_1m                            14,120                14,453                            333    2.36%   x 0.98 
 btree::map::iteration_1000                     3,919                 4,008                              89    2.27%   x 0.98 
 btree::map::iteration_100000                   473,040               477,477                         4,437    0.94%   x 0.99 
 btree::map::iteration_20                       73                    74                                  1    1.37%   x 0.99 
 btree::map::iteration_mut_1000                 4,065                 4,004                             -61   -1.50%   x 1.02 
 btree::map::iteration_mut_100000               468,690               434,082                       -34,608   -7.38%   x 1.08 
 btree::map::iteration_mut_20                   78                    73                                 -5   -6.41%   x 1.07 
 btree::map::range_included_excluded            395,607               384,372                       -11,235   -2.84%   x 1.03 
 btree::map::range_included_included            407,530               380,624                       -26,906   -6.60%   x 1.07 
 btree::map::range_included_unbounded           110,565               125,029                        14,464   13.08%   x 0.88 
 btree::map::range_unbounded_unbounded          74,247                75,505                          1,258    1.69%   x 0.98 
 btree::map::range_unbounded_vs_iter            136,143               138,270                         2,127    1.56%   x 0.98 
 btree::set::clone_100                          1,234                 1,195                             -39   -3.16%   x 1.03 
 btree::set::clone_100_and_clear                1,250                 1,199                             -51   -4.08%   x 1.04 
 btree::set::clone_100_and_drain_all            2,361                 2,321                             -40   -1.69%   x 1.02 
 btree::set::clone_100_and_drain_half           1,911                 1,829                             -82   -4.29%   x 1.04 
 btree::set::clone_100_and_into_iter            1,253                 1,208                             -45   -3.59%   x 1.04 
 btree::set::clone_100_and_pop_all              1,674                 1,634                             -40   -2.39%   x 1.02 
 btree::set::clone_100_and_remove_all           2,764                 2,258                            -506  -18.31%   x 1.22 
 btree::set::clone_100_and_remove_half          1,817                 1,738                             -79   -4.35%   x 1.05 
 btree::set::clone_10k                          147,634               141,749                        -5,885   -3.99%   x 1.04 
 btree::set::clone_10k_and_clear                139,936               132,077                        -7,859   -5.62%   x 1.06 
 btree::set::clone_10k_and_drain_all            250,283               243,180                        -7,103   -2.84%   x 1.03 
 btree::set::clone_10k_and_drain_half           218,285               211,095                        -7,190   -3.29%   x 1.03 
 btree::set::clone_10k_and_into_iter            143,524               135,397                        -8,127   -5.66%   x 1.06 
 btree::set::clone_10k_and_pop_all              191,210               179,708                       -11,502   -6.02%   x 1.06 
 btree::set::clone_10k_and_remove_all           325,636               283,829                       -41,807  -12.84%   x 1.15 
 btree::set::clone_10k_and_remove_half          343,480               338,654                        -4,826   -1.41%   x 1.01 
 btree::set::difference_random_100_vs_100       676                   700                                24    3.55%   x 0.97 
 btree::set::difference_random_100_vs_10k       2,393                 2,356                             -37   -1.55%   x 1.02 
 btree::set::difference_random_10k_vs_100       49,785                48,707                         -1,078   -2.17%   x 1.02 
 btree::set::difference_random_10k_vs_10k       161,528               161,777                           249    0.15%   x 1.00 
 btree::set::difference_staggered_100_vs_100    615                   605                               -10   -1.63%   x 1.02 
 btree::set::difference_staggered_100_vs_10k    2,123                 2,163                              40    1.88%   x 0.98 
 btree::set::difference_staggered_10k_vs_10k    59,483                60,204                            721    1.21%   x 0.99 
 btree::set::intersection_100_neg_vs_100_pos    14                    13                                 -1   -7.14%   x 1.08 
 btree::set::intersection_100_neg_vs_10k_pos    15                    15                                  0    0.00%   x 1.00 
 btree::set::intersection_100_pos_vs_100_neg    13                    13                                  0    0.00%   x 1.00 
 btree::set::intersection_100_pos_vs_10k_neg    15                    15                                  0    0.00%   x 1.00 
 btree::set::intersection_10k_neg_vs_100_pos    17                    14                                 -3  -17.65%   x 1.21 
 btree::set::intersection_10k_neg_vs_10k_pos    16                    16                                  0    0.00%   x 1.00 
 btree::set::intersection_10k_pos_vs_100_neg    14                    14                                  0    0.00%   x 1.00 
 btree::set::intersection_10k_pos_vs_10k_neg    16                    17                                  1    6.25%   x 0.94 
 btree::set::intersection_random_100_vs_100     655                   690                                35    5.34%   x 0.95 
 btree::set::intersection_random_100_vs_10k     2,013                 2,153                             140    6.95%   x 0.93 
 btree::set::intersection_random_10k_vs_100     2,067                 2,243                             176    8.51%   x 0.92 
 btree::set::intersection_random_10k_vs_10k     149,295               146,140                        -3,155   -2.11%   x 1.02 
 btree::set::intersection_staggered_100_vs_100  545                   530                               -15   -2.75%   x 1.03 
 btree::set::intersection_staggered_100_vs_10k  1,953                 1,983                              30    1.54%   x 0.98 
 btree::set::intersection_staggered_10k_vs_10k  53,822                53,558                           -264   -0.49%   x 1.00 
 btree::set::is_subset_100_vs_100               503                   510                                 7    1.39%   x 0.99 
 btree::set::is_subset_100_vs_10k               1,278                 1,607                             329   25.74%   x 0.80 
 btree::set::is_subset_10k_vs_100               2                     2                                   0    0.00%   x 1.00 
 btree::set::is_subset_10k_vs_10k               51,933                51,803                           -130   -0.25%   x 1.00 

It's a consistent win on clones, especially clone_fat*, and the rest of the results look noisy to me.

@ssomers
Copy link
Contributor

ssomers commented Feb 4, 2021

I rebased and tackled #80886 in the ssomers:btree-node-init branch (which I don't know how to formally offer because it doesn't compile on last week's master).

I got the same headroom as using box in #81444, i.e., supported value size grows from ~20kB to ~26kB but not to the ~37kB of Rust 1.33. And:

a similar benchmark improvement on an Intel Core i5-9500F running Windows x64
cargo-benchcmp.exe benchcmp o c --threshold 5
 name                                           o ns/iter  c ns/iter  diff ns/iter   diff %  speedup
 btree::map::clone_fat_100                      46,510     27,925          -18,585  -39.96%   x 1.67
 btree::map::clone_fat_100_and_clear            45,854     26,142          -19,712  -42.99%   x 1.75
 btree::map::clone_fat_100_and_drain_all        157,802    136,440         -21,362  -13.54%   x 1.16
 btree::map::clone_fat_100_and_drain_half       108,765    88,273          -20,492  -18.84%   x 1.23
 btree::map::clone_fat_100_and_into_iter        74,125     51,741          -22,384  -30.20%   x 1.43
 btree::map::clone_fat_100_and_pop_all          175,670    154,510         -21,160  -12.05%   x 1.14
 btree::map::clone_fat_100_and_remove_all       209,005    189,450         -19,555   -9.36%   x 1.10
 btree::map::clone_fat_100_and_remove_half      127,445    106,710         -20,735  -16.27%   x 1.19
 btree::map::clone_fat_val_100                  23,845     12,950          -10,895  -45.69%   x 1.84
 btree::map::clone_fat_val_100_and_clear        23,870     13,007          -10,863  -45.51%   x 1.84
 btree::map::clone_fat_val_100_and_drain_all    58,946     48,128          -10,818  -18.35%   x 1.22
 btree::map::clone_fat_val_100_and_drain_half   48,361     38,160          -10,201  -21.09%   x 1.27
 btree::map::clone_fat_val_100_and_into_iter    34,825     23,798          -11,027  -31.66%   x 1.46
 btree::map::clone_fat_val_100_and_pop_all      67,153     56,832          -10,321  -15.37%   x 1.18
 btree::map::clone_fat_val_100_and_remove_all   75,362     66,278           -9,084  -12.05%   x 1.14
 btree::map::clone_fat_val_100_and_remove_half  52,377     42,213          -10,164  -19.41%   x 1.24
 btree::map::clone_slim_100                     2,171      1,886              -285  -13.13%   x 1.15
 btree::map::clone_slim_100_and_clear           5,138      1,890            -3,248  -63.22%   x 2.72
 btree::map::clone_slim_100_and_drain_all       6,481      3,252            -3,229  -49.82%   x 1.99
 btree::map::clone_slim_100_and_drain_half      5,923      2,689            -3,234  -54.60%   x 2.20
 btree::map::clone_slim_100_and_into_iter       5,194      1,924            -3,270  -62.96%   x 2.70
 btree::map::clone_slim_100_and_pop_all         6,719      3,483            -3,236  -48.16%   x 1.93
 btree::map::clone_slim_100_and_remove_all      7,887      4,661            -3,226  -40.90%   x 1.69
 btree::map::clone_slim_100_and_remove_half     6,212      2,977            -3,235  -52.08%   x 2.09
 btree::map::clone_slim_10k                     262,265    236,255         -26,010   -9.92%   x 1.11
 btree::map::clone_slim_10k_and_clear           262,640    237,290         -25,350   -9.65%   x 1.11
 btree::map::clone_slim_10k_and_drain_all       402,575    369,490         -33,085   -8.22%   x 1.09
 btree::map::clone_slim_10k_and_drain_half      368,295    338,110         -30,185   -8.20%   x 1.09
 btree::map::clone_slim_10k_and_into_iter       262,753    236,507         -26,246   -9.99%   x 1.11
 btree::map::clone_slim_10k_and_pop_all         437,210    408,295         -28,915   -6.61%   x 1.07
 btree::map::clone_slim_10k_and_remove_all      603,210    569,725         -33,485   -5.55%   x 1.06
 btree::map::clone_slim_10k_and_remove_half     573,290    516,450         -56,840   -9.91%   x 1.11
 btree::map::find_rand_100                      14         13                   -1   -7.14%   x 1.08
 btree::map::find_rand_10_000                   61         56                   -5   -8.20%   x 1.09
 btree::map::find_seq_100                       14         12                   -2  -14.29%   x 1.17
 btree::map::find_seq_10_000                    41         37                   -4   -9.76%   x 1.11
 btree::map::first_and_last_0                   10         12                    2   20.00%   x 0.83
 btree::map::first_and_last_100                 50         42                   -8  -16.00%   x 1.19
 btree::map::first_and_last_10k                 61         68                    7   11.48%   x 0.90
 btree::map::insert_rand_100                    42         39                   -3   -7.14%   x 1.08
 btree::set::clone_100_and_pop_all              2,707      2,566              -141   -5.21%   x 1.05
 btree::set::clone_10k                          212,632    225,525          12,893    6.06%   x 0.94
 btree::set::difference_random_100_vs_100       853        752                -101  -11.84%   x 1.13
 btree::set::difference_random_10k_vs_100       62,564     58,306           -4,258   -6.81%   x 1.07
 btree::set::difference_staggered_100_vs_100    870        785                 -85   -9.77%   x 1.11
 btree::set::difference_staggered_10k_vs_10k    83,913     76,138           -7,775   -9.27%   x 1.10
 btree::set::intersection_staggered_100_vs_100  625        661                  36    5.76%   x 0.95
 btree::set::is_subset_100_vs_10k               1,619      1,186              -433  -26.74%   x 1.37

@Mark-Simulacrum
Copy link
Member

This looks overall good to me. I'm guessing there's a few places we dereference the Box during destruction or movement of the key/value pairs that could be changed to avoid putting keys and/or values on the stack too, but that can be tackled separately.

@bors r+ rollup=never (likely to at least be perf noisy)

@bors
Copy link
Collaborator

bors commented Feb 12, 2021

📌 Commit d828de6280c5cd6e4ce24961af111e19f28c7668 has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 12, 2021
@bors
Copy link
Collaborator

bors commented Feb 12, 2021

⌛ Testing commit d828de6280c5cd6e4ce24961af111e19f28c7668 with merge ce13e5d02f6cffd21da078a31544ffbf4b8a070e...

@bors
Copy link
Collaborator

bors commented Feb 12, 2021

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Feb 12, 2021
@rust-log-analyzer

This comment has been minimized.

@ssomers
Copy link
Contributor

ssomers commented Feb 12, 2021

This PR needs to tackle #80886 such as in the ssomers:btree-node-init branch. I don't know how to formally offer such a change in this PR without mixing in a whole lot of changes such as #80886 itself.

@cuviper
Copy link
Member Author

cuviper commented Feb 12, 2021

Thanks @ssomers -- I rebased this PR and cherry-picked your fix.

@Mark-Simulacrum
Copy link
Member

@bors r+ rollup=never

@bors
Copy link
Collaborator

bors commented Feb 13, 2021

📌 Commit 5a58cf4 has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 13, 2021
@bors
Copy link
Collaborator

bors commented Feb 13, 2021

⌛ Testing commit 5a58cf4 with merge 3c10a88...

@bors
Copy link
Collaborator

bors commented Feb 13, 2021

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing 3c10a88 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 13, 2021
@bors bors merged commit 3c10a88 into rust-lang:master Feb 13, 2021
@rustbot rustbot added this to the 1.52.0 milestone Feb 13, 2021
@bors bors mentioned this pull request Feb 13, 2021
@Mark-Simulacrum
Copy link
Member

It looks like this ended up being a pretty uniform regression on a number of benchmarks, including real world ones. It doesn't seem like it should be, since it's presumably doing strictly less work.

I think it would be good to examine a cachegrind diff or similar to try to narrow down exactly where the extra cost is sneaking in so we can perhaps mitigate it.

@cuviper
Copy link
Member Author

cuviper commented Feb 16, 2021

It doesn't seem like it should be, since it's presumably doing strictly less work.

Right, that's my expectation too. I'll try to look into this...

@ssomers
Copy link
Contributor

ssomers commented Feb 17, 2021

I wrote a standalone microbenchmark confirming that "strictly less work" takes ~3% more time.

running 2 tests
test heap  ... bench:         111 ns/iter (+/- 7)
test stack ... bench:         105 ns/iter (+/- 4)

The difference disappears when taking out the parent_idx field. Which to me suggests that the init function, by zooming in on the 16 bit len field, actually does more work than the stack constructor that knows it can spill over into the adjacent bits that include the parent_idx field.

@Mark-Simulacrum
Copy link
Member

Could we also do that? There's no reason we can't initialize more memory, though it seems odd that a 16-bit write would be slower than 32 (or 64).

It also seems like the difference you're seeing is potentially just noise, though hard to tell. Certainly those ranges overlap :)

@ssomers
Copy link
Contributor

ssomers commented Feb 17, 2021

I did try to let the init function write an uninit into parent_idx, and wasn't surprised this changes nothing. I didn't think of writing an actual 0 and indeed that speeds up the heap benchmark on par with the stack benchmark, on a 64 bit machine.

I'm not at all surprised a 16-bit write would be slower. Depending on CPU architecture (which I only knew little about last century) and instructions generated (which I know nothing about), it may need to first read 64 bits, surgically replace 16 bits, and write back 64 bits.

@ssomers
Copy link
Contributor

ssomers commented Feb 17, 2021

I finally managed to get Godbolt to produce some readable comparison. The stack implementation copies 128 zero bits at once with some fancy modern 128 bit instructions, while the heap implementation carefully copies 64 and 32 of those.

One wrong conclusion is that the difference is due to the example setting parent_idx to 0, where in reality it was already zeroed and the stack method has some snitch whispering to the compiler it can leave out that instruction. But when the example sets parent_idx to some non-zero value, the stack implementation emits the same instruction yet emits a mov qword ptr [rax + 8] instruction instead of a mov dword ptr [rax + 8] and still has the advantage.

So I think the difference is that somehow the stack implementation realizes it has 64 bits of space to dump bits in, while the directly-on-the-heap implementation writes only 32 bits. Note that it doesn't limit to 16 bits, the size of parent_id, but overwrites the other 16 bit field len using the value that it knows it already has written earlier. As if the compiler thinks that writing 32 bits is better than writing 16 bits, but writing 64 bits (to an 8 byte aligned address) is only useful when it's to the stack.

There's no difference between heap and stack initialization (on my machine) if the initialized field is u8 or u32 instead of u16, while the size of the uninitialized field doesn't matter (unless it exceeds 32 bits, obviously).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants