Skip to content

Conversation

im-0
Copy link
Contributor

@im-0 im-0 commented Sep 9, 2025

Hi!

I am trying to write a software 3D rasterizer and want to use nalgebra with SIMD as a math library. Here I tried to add AoSoA SIMD support for Matrix4::transform_point() and friends. I also benchmarked my modifications to make sure that I didn't regressed anything, and to make sure that SIMD support makes sense here at all.

Benchamrks were performed on various CPUs that I have. Here are the results:

click to expand

Benchmark results on AMD Ryzen 9 5950X, Linux, compared to previous one:

$ RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [317.64 ps 317.78 ps 318.02 ps]
						change: [+0.4783% +0.5229% +0.5801%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3  time:   [441.89 ps 442.50 ps 443.11 ps]
						change: [+0.2644% +0.3559% +0.4672%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mat3_transform_point2   time:   [316.78 ps 317.11 ps 317.45 ps]
						change: [+0.3692% +0.4728% +0.5758%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

mat4_transform_point3   time:   [443.89 ps 443.96 ps 444.04 ps]
						change: [+0.9537% +1.0151% +1.0630%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

mat3_transform_vector2_x4wide
						time:   [425.44 ps 425.70 ps 425.89 ps]

mat4_transform_vector3_x4wide
						time:   [646.75 ps 646.85 ps 647.00 ps]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

mat3_transform_point2_x4wide
						time:   [422.59 ps 422.71 ps 422.87 ps]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe

mat4_transform_point3_x4wide
						time:   [636.52 ps 636.61 ps 636.70 ps]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

mat4_transform_vector3_no_division
						time:   [443.83 ps 443.97 ps 444.13 ps]
						change: [-0.1875% -0.1171% -0.0495%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

$ bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [316.72 ps 316.99 ps 317.22 ps]
						change: [-0.2793% -0.1789% -0.0874%] (p = 0.00 < 0.05)
						Change within noise threshold.

mat4_transform_vector3  time:   [441.41 ps 441.49 ps 441.58 ps]
						change: [+0.1228% +0.5653% +0.8103%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  3 (3.00%) high mild
  6 (6.00%) high severe

mat3_transform_point2   time:   [317.29 ps 317.52 ps 317.74 ps]
						change: [+0.1923% +0.2853% +0.3731%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low severe
  1 (1.00%) high mild
  6 (6.00%) high severe

mat4_transform_point3   time:   [441.69 ps 441.79 ps 441.91 ps]
						change: [-0.0066% +0.1238% +0.2511%] (p = 0.06 > 0.05)
						No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

mat3_transform_vector2_x4wide
						time:   [430.98 ps 431.02 ps 431.06 ps]
						change: [+1.5248% +1.6230% +1.7210%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

mat4_transform_vector3_x4wide
						time:   [646.64 ps 646.72 ps 646.81 ps]
						change: [-0.0589% -0.0249% +0.0012%] (p = 0.11 > 0.05)
						No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

mat3_transform_point2_x4wide
						time:   [431.82 ps 431.89 ps 431.96 ps]
						change: [+2.0913% +2.1413% +2.1815%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high severe

mat4_transform_point3_x4wide
						time:   [646.37 ps 646.41 ps 646.46 ps]
						change: [+1.5411% +1.5729% +1.6038%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

mat4_transform_vector3_no_division
						time:   [443.87 ps 444.10 ps 444.31 ps]
						change: [-0.6030% -0.4238% -0.2411%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmark results on Intel i7-8565U, Linux, compared to previous one:

$ RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [293.12 ps 295.95 ps 298.98 ps]
						change: [+1.4817% +2.1154% +2.6804%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  15 (15.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high severe

mat4_transform_vector3  time:   [540.54 ps 540.82 ps 541.10 ps]
						change: [-8.7714% -8.5103% -8.2102%] (p = 0.00 < 0.05)
						Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

mat3_transform_point2   time:   [305.57 ps 305.91 ps 306.43 ps]
						change: [+4.2420% +4.6490% +4.9390%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

mat4_transform_point3   time:   [546.64 ps 546.97 ps 547.39 ps]
						change: [-7.5737% -7.2913% -7.0061%] (p = 0.00 < 0.05)
						Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

mat3_transform_vector2_x4wide
						time:   [499.51 ps 499.85 ps 500.30 ps]
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe

mat4_transform_vector3_x4wide
						time:   [774.40 ps 775.65 ps 777.02 ps]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

mat3_transform_point2_x4wide
						time:   [526.26 ps 529.71 ps 535.84 ps]
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

mat4_transform_point3_x4wide
						time:   [796.16 ps 796.74 ps 797.42 ps]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_no_division
						time:   [529.04 ps 530.22 ps 531.70 ps]
						change: [-15.420% -13.707% -12.683%] (p = 0.00 < 0.05)
						Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  7 (7.00%) high mild
  3 (3.00%) high severe

$ bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [307.16 ps 307.34 ps 307.55 ps]
						change: [+4.7516% +5.1517% +5.5680%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high severe

mat4_transform_vector3  time:   [575.44 ps 576.15 ps 576.97 ps]
						change: [+5.2866% +5.6903% +6.0084%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild

mat3_transform_point2   time:   [301.50 ps 301.79 ps 302.15 ps]
						change: [+2.2124% +2.5777% +2.9744%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  4 (4.00%) high mild
  6 (6.00%) high severe

mat4_transform_point3   time:   [584.46 ps 584.88 ps 585.45 ps]
						change: [-4.1562% -3.8892% -3.5962%] (p = 0.00 < 0.05)
						Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  9 (9.00%) high severe

mat3_transform_vector2_x4wide
						time:   [530.59 ps 536.69 ps 544.38 ps]
						change: [+6.3284% +6.9464% +7.6364%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

mat4_transform_vector3_x4wide
						time:   [813.42 ps 814.31 ps 815.26 ps]
						change: [+4.3840% +4.9097% +5.3254%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild

mat3_transform_point2_x4wide
						time:   [523.32 ps 523.71 ps 524.08 ps]
						change: [-5.0939% -2.1361% -0.3320%] (p = 0.08 > 0.05)
						No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

mat4_transform_point3_x4wide
						time:   [842.23 ps 842.86 ps 843.56 ps]
						change: [+5.5553% +5.9587% +6.2648%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low severe
  1 (1.00%) high severe

mat4_transform_vector3_no_division
						time:   [566.75 ps 567.19 ps 567.60 ps]
						change: [-1.3344% -0.9935% -0.4433%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmark results on Apple M2 Max, Linux, compared to previous one:

$ cargo bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [315.56 ps 316.70 ps 317.92 ps]
						change: [-0.6503% -0.0844% +0.4491%] (p = 0.76 > 0.05)
						No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

mat4_transform_vector3  time:   [383.68 ps 383.82 ps 384.07 ps]
						change: [-0.0143% +0.1640% +0.4628%] (p = 0.24 > 0.05)
						No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

mat3_transform_point2   time:   [318.32 ps 318.98 ps 319.63 ps]
						change: [+0.9643% +1.4702% +2.0348%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

mat4_transform_point3   time:   [384.04 ps 384.21 ps 384.43 ps]
						change: [+0.1676% +0.2195% +0.2713%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mat3_transform_vector2_x4wide
						time:   [309.13 ps 309.56 ps 310.00 ps]
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

mat4_transform_vector3_x4wide
						time:   [460.39 ps 460.46 ps 460.56 ps]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

mat3_transform_point2_x4wide
						time:   [308.68 ps 309.09 ps 309.52 ps]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

mat4_transform_point3_x4wide
						time:   [460.38 ps 460.42 ps 460.46 ps]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_no_division
						time:   [383.69 ps 383.73 ps 383.78 ps]
						change: [-0.0192% +0.0117% +0.0459%] (p = 0.51 > 0.05)
						No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmark results on Broadcom BCM2711 (Raspberry Pi 4), Linux, compared to previous one:

mat3_transform_vector2  time:   [1.1179 ns 1.1185 ns 1.1191 ns]
						change: [+0.1194% +0.2209% +0.3215%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

mat4_transform_vector3  time:   [1.6815 ns 1.6817 ns 1.6819 ns]
						change: [-0.0434% -0.0159% +0.0102%] (p = 0.26 > 0.05)
						No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

mat3_transform_point2   time:   [1.1170 ns 1.1174 ns 1.1179 ns]
						change: [-0.0609% +0.0024% +0.0661%] (p = 0.94 > 0.05)
						No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

mat4_transform_point3   time:   [1.6817 ns 1.6819 ns 1.6821 ns]
						change: [-0.0088% +0.0325% +0.0857%] (p = 0.20 > 0.05)
						No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

mat3_transform_vector2_x4wide
						time:   [2.2614 ns 2.2618 ns 2.2622 ns]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

mat4_transform_vector3_x4wide
						time:   [3.3940 ns 3.3960 ns 3.3993 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

mat3_transform_point2_x4wide
						time:   [2.2614 ns 2.2617 ns 2.2619 ns]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

mat4_transform_point3_x4wide
						time:   [3.3941 ns 3.3954 ns 3.3970 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

mat4_transform_vector3_no_division
						time:   [1.9712 ns 1.9720 ns 1.9729 ns]
						change: [-0.0164% +0.0278% +0.0815%] (p = 0.26 > 0.05)
						No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

The results are mostly expected: there are no significant regressions in non-SIMD benchmarks; SIMD makes sense even on the potato CPU of Raspberry Pi 4. The only exception is the older Intel CPU (i7-8565U), where transform_vector() and transform_point() slightly regressed for a 2D case, but improved for a 3D case.

It is also possible to leave existing functions as is, and add new SIMD-specific functions instead.

Other semi-related changes included in this PR:

  • Perspective3::project_vector() fixed so that result matches the result of matrix multiplication.
  • Added tests for {Orthographic3, Perspective3}::project_vector() (mainly to understand what exactly Perspective3 projection does for a vector).
  • Added codegen-units = 1 for benchmarks, as otherwise results are less consistent and change after unrelated code changes.
  • Improved documentation for Matrix*::transform*() and Perspective3:project_*().
  • Added tests for Matrix*::transform_*().

P.S.: Added SIMD benchmarks require simba with this PR merged: dimforge/simba#76

@im-0
Copy link
Contributor Author

im-0 commented Sep 9, 2025

If you decide to merge this PR, please merge it without squashing into a single commit. Single commit will make benchmarking before/after changes much more difficult.

@im-0
Copy link
Contributor Author

im-0 commented Sep 10, 2025

I asked my friends to run the same benchmarks on other CPUs (preferably Intel) and got some interesting results. It turned out that some benchmarks regressed and some improved on all tested Intel CPUs. The regression was more or less stable across runs, but improvements were seemingly random. Then we tried to disable the Intel Turbo Boost and this "fixed" both regressions and improvements.

My benchmarking routine is basically following:

  • cpupower frequency-set --governor performance
  • echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
  • git checkout $BEFORE_CHANGES
  • RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" cargo bench --all-features --bench nalgebra_bench -- --save-baseline base _transform_
  • git checkout $AFTER_CHANGES
  • RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" cargo bench --all-features --bench nalgebra_bench -- --baseline-lenient base _transform_
click to expand

Intel i9-9900K:

mat3_transform_vector2  time:   [331.96 ps 335.43 ps 338.46 ps]
                        change: [-1.5766% -0.5575% +0.4401%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  15 (15.00%) low severe
  5 (5.00%) low mild

mat4_transform_vector3  time:   [633.70 ps 634.00 ps 634.22 ps]
                        change: [-0.1789% +0.1501% +0.4816%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  9 (9.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

mat3_transform_point2   time:   [333.65 ps 334.05 ps 334.41 ps]
                        change: [-0.6090% -0.2228% +0.1225%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild

mat4_transform_point3   time:   [632.13 ps 632.38 ps 632.60 ps]
                        change: [-0.1452% +0.1086% +0.4231%] (p = 0.54 > 0.05)
                        No change in performance detected.
Found 19 outliers among 100 measurements (19.00%)
  9 (9.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

mat3_transform_vector2_x4wide
                        time:   [623.93 ps 624.03 ps 624.11 ps]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

mat4_transform_vector3_x4wide
                        time:   [921.04 ps 921.18 ps 921.30 ps]
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

mat3_transform_point2_x4wide
                        time:   [622.56 ps 622.84 ps 623.10 ps]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild

mat4_transform_point3_x4wide
                        time:   [921.65 ps 921.76 ps 921.87 ps]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low severe

mat4_transform_vector3_no_division
                        time:   [658.94 ps 658.98 ps 659.03 ps]
                        change: [-0.4661% -0.0747% +0.3524%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc41)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-unknown-linux-gnu
release: 1.89.0
LLVM version: 19.1.7

Intel i7-8565U:

mat3_transform_vector2  time:   [670.49 ps 670.56 ps 670.65 ps]
                        change: [-1.0910% -0.3353% +0.2519%] (p = 0.41 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  3 (3.00%) high mild
  8 (8.00%) high severe

mat4_transform_vector3  time:   [1.2667 ns 1.2670 ns 1.2672 ns]
                        change: [-0.3486% -0.0169% +0.2990%] (p = 0.86 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  5 (5.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

mat3_transform_point2   time:   [670.51 ps 670.60 ps 670.67 ps]
                        change: [-0.6555% -0.1881% +0.2037%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

mat4_transform_point3   time:   [1.2670 ns 1.2671 ns 1.2673 ns]
                        change: [-0.3295% -0.0358% +0.2261%] (p = 0.83 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe

mat3_transform_vector2_x4wide
                        time:   [1.2511 ns 1.2520 ns 1.2534 ns]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_x4wide
                        time:   [1.8466 ns 1.8758 ns 1.9305 ns]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

mat3_transform_point2_x4wide
                        time:   [1.2511 ns 1.2514 ns 1.2519 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

mat4_transform_point3_x4wide
                        time:   [1.8463 ns 1.8467 ns 1.8471 ns]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_no_division
                        time:   [1.3203 ns 1.3211 ns 1.3227 ns]
                        change: [-0.1767% +0.3608% +1.0624%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-unknown-linux-gnu
release: 1.89.0
LLVM version: 20.1.8

Qualcomm Snapdragon SC8280XP:

mat3_transform_vector2  time:   [336.41 ps 336.44 ps 336.47 ps]
                        change: [-0.1865% -0.1207% -0.0586%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  3 (3.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3  time:   [672.13 ps 672.51 ps 672.94 ps]
                        change: [+19.334% +19.508% +19.680%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe

mat3_transform_point2   time:   [368.87 ps 370.03 ps 371.06 ps]
                        change: [-1.7333% -1.2399% -0.7379%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild

mat4_transform_point3   time:   [672.53 ps 672.63 ps 672.74 ps]
                        change: [+19.161% +19.290% +19.392%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

mat3_transform_vector2_x4wide
                        time:   [1.1753 ns 1.1758 ns 1.1763 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_x4wide
                        time:   [1.2727 ns 1.2732 ns 1.2739 ns]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

mat3_transform_point2_x4wide
                        time:   [1.1764 ns 1.1765 ns 1.1766 ns]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

mat4_transform_point3_x4wide
                        time:   [1.2721 ns 1.2724 ns 1.2728 ns]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low severe
  5 (5.00%) low mild
  5 (5.00%) high severe

mat4_transform_vector3_no_division
                        time:   [670.60 ps 670.65 ps 670.70 ps]
                        change: [+19.198% +19.363% +19.482%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: aarch64-unknown-linux-gnu
release: 1.89.0
LLVM version: 20.1.8

Intel i3-1115G4:

# Before
mat3_transform_vector2  time:   [269.47 ps 269.73 ps 269.97 ps]
Found 31 outliers among 100 measurements (31.00%)
  19 (19.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe

mat4_transform_vector3  time:   [715.00 ps 715.42 ps 715.80 ps]
Found 34 outliers among 100 measurements (34.00%)
  24 (24.00%) low severe
  2 (2.00%) high mild
  8 (8.00%) high severe

mat3_transform_point2   time:   [269.38 ps 269.67 ps 269.94 ps]
Found 30 outliers among 100 measurements (30.00%)
  21 (21.00%) low severe
  1 (1.00%) high mild
  8 (8.00%) high severe

mat4_transform_point3   time:   [714.56 ps 715.05 ps 715.52 ps]
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

mat4_transform_vector3_no_division
                        time:   [740.11 ps 740.61 ps 741.10 ps]
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) low severe
  4 (4.00%) low mild

# After
mat3_transform_vector2  time:   [269.35 ps 269.56 ps 269.79 ps]
Found 32 outliers among 100 measurements (32.00%)
  24 (24.00%) low severe
  3 (3.00%) high mild
  5 (5.00%) high severe

mat4_transform_vector3  time:   [714.96 ps 715.42 ps 715.83 ps]
Found 22 outliers among 100 measurements (22.00%)
  11 (11.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

mat3_transform_point2   time:   [270.14 ps 273.05 ps 276.49 ps]
Found 22 outliers among 100 measurements (22.00%)
  5 (5.00%) low severe
  7 (7.00%) low mild
  10 (10.00%) high severe

mat4_transform_point3   time:   [715.12 ps 715.77 ps 716.56 ps]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  8 (8.00%) high severe

mat3_transform_vector2_x4wide
                        time:   [783.04 ps 783.23 ps 783.41 ps]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

mat4_transform_vector3_x4wide
                        time:   [867.52 ps 879.03 ps 892.57 ps]
Found 23 outliers among 100 measurements (23.00%)
  5 (5.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high mild
  12 (12.00%) high severe

mat3_transform_point2_x4wide
                        time:   [783.32 ps 784.07 ps 785.11 ps]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  3 (3.00%) high mild
  6 (6.00%) high severe

mat4_transform_point3_x4wide
                        time:   [871.11 ps 872.93 ps 874.44 ps]
Found 20 outliers among 100 measurements (20.00%)
  11 (11.00%) low severe
  7 (7.00%) low mild
  2 (2.00%) high severe

mat4_transform_vector3_no_division
                        time:   [740.93 ps 746.02 ps 752.11 ps]
Found 27 outliers among 100 measurements (27.00%)
  9 (9.00%) low severe
  5 (5.00%) low mild
  13 (13.00%) high severe

fr0@calculate ~/nalgebra $ rustc -vV
rustc 1.88.0 (6b00bc388 2025-06-23) (gentoo)
binary: rustc
commit-hash: 6b00bc3880198600130e1cf62b8f8a93494488cc
commit-date: 2025-06-23
host: x86_64-unknown-linux-gnu
release: 1.88.0
LLVM version: 20.1.7

Intel Core Ultra 5 135H:

mat3_transform_vector2  time:   [229.52 ps 239.71 ps 250.38 ps]
                        change: [+12.450% +18.380% +24.916%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

mat4_transform_vector3  time:   [304.63 ps 316.86 ps 330.10 ps]
                        change: [+0.7303% +7.5166% +15.112%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

mat3_transform_point2   time:   [202.05 ps 208.41 ps 215.52 ps]
                        change: [-1.5674% +5.0112% +12.211%] (p = 0.14 > 0.05)
                        No change in performance detected.

mat4_transform_point3   time:   [298.05 ps 309.35 ps 320.75 ps]
                        change: [-4.2077% +1.6736% +7.5483%] (p = 0.57 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mat3_transform_vector2_x4wide
                        time:   [535.45 ps 546.86 ps 559.21 ps]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

mat4_transform_vector3_x4wide
                        time:   [569.78 ps 585.57 ps 601.68 ps]

mat3_transform_point2_x4wide
                        time:   [540.24 ps 554.80 ps 570.01 ps]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mat4_transform_point3_x4wide
                        time:   [547.72 ps 560.44 ps 573.66 ps]
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_no_division
                        time:   [553.34 ps 569.96 ps 587.30 ps]
                        change: [+65.222% +75.055% +85.760%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

rustc 1.89.0 (29483883e 2025-08-04)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-pc-windows-msvc
release: 1.89.0 LLVM version: 20.1.7

Given the Turbo Boost-related weirdness and a significant consistent regression on Qualcomm SC8280XP, I think it will be better to just add separate SIMD-supporting functions. Will update this PR soon.

This makes Perspective3::project_vector() consistent with the same
perspective projection applied by just multiplying the underlying matrix
by source vector converted to homogeneous coordinates.
During benchmarking I found that `codegen-units` with default value
leads to inconsistent results across recompilations (clean vs.
incremental). Also, sometimes it leads to a significant performance
degradation of benchmarks unrelated to code changes.
Following benchmarks were perfomed on Linux after
`cpupower frequency-set --governor performance` and with default
RUSTFLAGS.

AMD Ryzen 9 5950X:

	mat3_transform_vector2  time:   [314.35 ps 314.44 ps 314.56 ps]
	Found 11 outliers among 100 measurements (11.00%)
	  3 (3.00%) high mild
	  8 (8.00%) high severe

	mat4_transform_vector3  time:   [440.44 ps 440.95 ps 441.45 ps]
	Found 13 outliers among 100 measurements (13.00%)
	  1 (1.00%) high mild
	  12 (12.00%) high severe

	mat3_transform_point2   time:   [314.40 ps 314.48 ps 314.60 ps]
	Found 9 outliers among 100 measurements (9.00%)
	  4 (4.00%) high mild
	  5 (5.00%) high severe

	mat4_transform_point3   time:   [436.98 ps 437.03 ps 437.08 ps]
	Found 8 outliers among 100 measurements (8.00%)
	  1 (1.00%) low mild
	  2 (2.00%) high mild
	  5 (5.00%) high severe

	mat3_transform_vector2_x4wide
							time:   [422.74 ps 422.85 ps 422.98 ps]
	Found 9 outliers among 100 measurements (9.00%)
	  1 (1.00%) low mild
	  3 (3.00%) high mild
	  5 (5.00%) high severe

	mat4_transform_vector3_x4wide
							time:   [635.36 ps 635.49 ps 635.63 ps]
	Found 7 outliers among 100 measurements (7.00%)
	  1 (1.00%) low mild
	  4 (4.00%) high mild
	  2 (2.00%) high severe

	mat3_transform_point2_x4wide
							time:   [422.21 ps 422.31 ps 422.47 ps]
	Found 12 outliers among 100 measurements (12.00%)
	  2 (2.00%) low mild
	  4 (4.00%) high mild
	  6 (6.00%) high severe

	mat4_transform_point3_x4wide
							time:   [635.19 ps 635.27 ps 635.37 ps]
	Found 10 outliers among 100 measurements (10.00%)
	  1 (1.00%) low severe
	  4 (4.00%) high mild
	  5 (5.00%) high severe

	mat4_transform_vector3_no_division
							time:   [439.86 ps 439.96 ps 440.07 ps]
	Found 13 outliers among 100 measurements (13.00%)
	  2 (2.00%) low severe
	  1 (1.00%) low mild
	  4 (4.00%) high mild
	  6 (6.00%) high severe

Intel Core i7-8565U:

	mat3_transform_vector2  time:   [294.41 ps 294.52 ps 294.63 ps]
	Found 7 outliers among 100 measurements (7.00%)
	  1 (1.00%) low severe
	  4 (4.00%) high mild
	  2 (2.00%) high severe

	mat4_transform_vector3  time:   [583.88 ps 587.06 ps 592.15 ps]
	Found 20 outliers among 100 measurements (20.00%)
	  4 (4.00%) low severe
	  1 (1.00%) low mild
	  3 (3.00%) high mild
	  12 (12.00%) high severe

	mat3_transform_point2   time:   [309.23 ps 309.52 ps 309.88 ps]
	Found 9 outliers among 100 measurements (9.00%)
	  2 (2.00%) low severe
	  2 (2.00%) high mild
	  5 (5.00%) high severe

	mat4_transform_point3   time:   [557.59 ps 558.26 ps 559.52 ps]
	Found 15 outliers among 100 measurements (15.00%)
	  3 (3.00%) low severe
	  7 (7.00%) low mild
	  3 (3.00%) high mild
	  2 (2.00%) high severe

	mat3_transform_vector2_x4wide
							time:   [557.77 ps 558.22 ps 558.75 ps]
	Found 3 outliers among 100 measurements (3.00%)
	  1 (1.00%) low severe
	  1 (1.00%) low mild
	  1 (1.00%) high severe

	mat4_transform_vector3_x4wide
							time:   [801.89 ps 802.37 ps 802.89 ps]
	Found 8 outliers among 100 measurements (8.00%)
	  1 (1.00%) low severe
	  4 (4.00%) high mild
	  3 (3.00%) high severe

	mat3_transform_point2_x4wide
							time:   [574.76 ps 575.10 ps 575.44 ps]
	Found 12 outliers among 100 measurements (12.00%)
	  2 (2.00%) low severe
	  3 (3.00%) low mild
	  1 (1.00%) high mild
	  6 (6.00%) high severe

	mat4_transform_point3_x4wide
							time:   [801.65 ps 802.63 ps 803.94 ps]
	Found 9 outliers among 100 measurements (9.00%)
	  1 (1.00%) low severe
	  2 (2.00%) low mild
	  2 (2.00%) high mild
	  4 (4.00%) high severe

	mat4_transform_vector3_no_division
							time:   [568.86 ps 569.60 ps 570.52 ps]
	Found 3 outliers among 100 measurements (3.00%)
	  2 (2.00%) high mild
	  1 (1.00%) high severe

Apple M2 Max:

	mat3_transform_vector2  time:   [316.93 ps 317.90 ps 318.91 ps]
	Found 3 outliers among 100 measurements (3.00%)
	  2 (2.00%) high mild
	  1 (1.00%) high severe

	mat4_transform_vector3  time:   [383.72 ps 383.77 ps 383.83 ps]
	Found 9 outliers among 100 measurements (9.00%)
	  5 (5.00%) high mild
	  4 (4.00%) high severe

	mat3_transform_point2   time:   [317.60 ps 318.44 ps 319.31 ps]
	Found 3 outliers among 100 measurements (3.00%)
	  1 (1.00%) low mild
	  2 (2.00%) high mild

	mat4_transform_point3   time:   [383.75 ps 383.78 ps 383.81 ps]
	Found 4 outliers among 100 measurements (4.00%)
	  1 (1.00%) high mild
	  3 (3.00%) high severe

	mat3_transform_vector2_x4wide
							time:   [308.93 ps 309.36 ps 309.80 ps]
	Found 2 outliers among 100 measurements (2.00%)
	  1 (1.00%) high mild
	  1 (1.00%) high severe

	mat4_transform_vector3_x4wide
							time:   [460.46 ps 460.50 ps 460.55 ps]
	Found 16 outliers among 100 measurements (16.00%)
	  6 (6.00%) low mild
	  5 (5.00%) high mild
	  5 (5.00%) high severe

	mat3_transform_point2_x4wide
							time:   [308.88 ps 309.27 ps 309.69 ps]
	Found 2 outliers among 100 measurements (2.00%)
	  2 (2.00%) high mild

	mat4_transform_point3_x4wide
							time:   [460.48 ps 460.52 ps 460.57 ps]
	Found 5 outliers among 100 measurements (5.00%)
	  3 (3.00%) high mild
	  2 (2.00%) high severe

	mat4_transform_vector3_no_division
							time:   [383.77 ps 383.86 ps 383.98 ps]
	Found 9 outliers among 100 measurements (9.00%)
	  4 (4.00%) high mild
	  5 (5.00%) high severe

Broadcom BCM2711 (Raspberry Pi 4):

	mat3_transform_vector2  time:   [1.1169 ns 1.1172 ns 1.1175 ns]
	Found 11 outliers among 100 measurements (11.00%)
	  4 (4.00%) high mild
	  7 (7.00%) high severe

	mat4_transform_vector3  time:   [1.6819 ns 1.6821 ns 1.6824 ns]
	Found 11 outliers among 100 measurements (11.00%)
	  8 (8.00%) high mild
	  3 (3.00%) high severe

	mat3_transform_point2   time:   [1.1168 ns 1.1169 ns 1.1171 ns]
	Found 10 outliers among 100 measurements (10.00%)
	  7 (7.00%) high mild
	  3 (3.00%) high severe

	mat4_transform_point3   time:   [1.6818 ns 1.6820 ns 1.6823 ns]
	Found 8 outliers among 100 measurements (8.00%)
	  5 (5.00%) high mild
	  3 (3.00%) high severe

	mat3_transform_vector2_x4wide
							time:   [2.2615 ns 2.2619 ns 2.2624 ns]
	Found 8 outliers among 100 measurements (8.00%)
	  5 (5.00%) high mild
	  3 (3.00%) high severe

	mat4_transform_vector3_x4wide
							time:   [3.3941 ns 3.3947 ns 3.3954 ns]
	Found 4 outliers among 100 measurements (4.00%)
	  2 (2.00%) high mild
	  2 (2.00%) high severe

	mat3_transform_point2_x4wide
							time:   [2.2615 ns 2.2619 ns 2.2622 ns]
	Found 8 outliers among 100 measurements (8.00%)
	  7 (7.00%) high mild
	  1 (1.00%) high severe

	mat4_transform_point3_x4wide
							time:   [3.3943 ns 3.3957 ns 3.3973 ns]
	Found 11 outliers among 100 measurements (11.00%)
	  1 (1.00%) low mild
	  4 (4.00%) high mild
	  6 (6.00%) high severe

	mat4_transform_vector3_no_division
							time:   [1.9711 ns 1.9719 ns 1.9728 ns]
	Found 11 outliers among 100 measurements (11.00%)
	  5 (5.00%) high mild
	  6 (6.00%) high severe

rustc -vV

	rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42)
	binary: rustc
	commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
	commit-date: 2025-08-04
	host: x86_64-unknown-linux-gnu
	release: 1.89.0
	LLVM version: 20.1.8

	rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42)
	binary: rustc
	commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
	commit-date: 2025-08-04
	host: aarch64-unknown-linux-gnu
	release: 1.89.0
	LLVM version: 20.1.8
@im-0 im-0 force-pushed the improve-cg-transform-and-project branch from c6e5a14 to fbba657 Compare September 10, 2025 23:09
@im-0
Copy link
Contributor Author

im-0 commented Sep 10, 2025

Done! Still requires Simba with dimforge/simba#76

@im-0
Copy link
Contributor Author

im-0 commented Sep 13, 2025

FYI: I filed a Rust issue about performance regressions with fat LTO and default codegen-units: rust-lang/rust#146497

@im-0
Copy link
Contributor Author

im-0 commented Sep 23, 2025

All benchmark that I did are invalid because of this: #1547 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant