SIMD CG transform and related improvements #1543

im-0 · 2025-09-09T16:12:03Z

Hi!

I am trying to write a software 3D rasterizer and want to use nalgebra with SIMD as a math library. Here I tried to add AoSoA SIMD support for Matrix4::transform_point() and friends. I also benchmarked my modifications to make sure that I didn't regressed anything, and to make sure that SIMD support makes sense here at all.

Benchamrks were performed on various CPUs that I have. Here are the results:

click to expand

Benchmark results on AMD Ryzen 9 5950X, Linux, compared to previous one:

$ RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [317.64 ps 317.78 ps 318.02 ps]
						change: [+0.4783% +0.5229% +0.5801%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3  time:   [441.89 ps 442.50 ps 443.11 ps]
						change: [+0.2644% +0.3559% +0.4672%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mat3_transform_point2   time:   [316.78 ps 317.11 ps 317.45 ps]
						change: [+0.3692% +0.4728% +0.5758%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

mat4_transform_point3   time:   [443.89 ps 443.96 ps 444.04 ps]
						change: [+0.9537% +1.0151% +1.0630%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

mat3_transform_vector2_x4wide
						time:   [425.44 ps 425.70 ps 425.89 ps]

mat4_transform_vector3_x4wide
						time:   [646.75 ps 646.85 ps 647.00 ps]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

mat3_transform_point2_x4wide
						time:   [422.59 ps 422.71 ps 422.87 ps]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe

mat4_transform_point3_x4wide
						time:   [636.52 ps 636.61 ps 636.70 ps]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

mat4_transform_vector3_no_division
						time:   [443.83 ps 443.97 ps 444.13 ps]
						change: [-0.1875% -0.1171% -0.0495%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

$ bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [316.72 ps 316.99 ps 317.22 ps]
						change: [-0.2793% -0.1789% -0.0874%] (p = 0.00 < 0.05)
						Change within noise threshold.

mat4_transform_vector3  time:   [441.41 ps 441.49 ps 441.58 ps]
						change: [+0.1228% +0.5653% +0.8103%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  3 (3.00%) high mild
  6 (6.00%) high severe

mat3_transform_point2   time:   [317.29 ps 317.52 ps 317.74 ps]
						change: [+0.1923% +0.2853% +0.3731%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low severe
  1 (1.00%) high mild
  6 (6.00%) high severe

mat4_transform_point3   time:   [441.69 ps 441.79 ps 441.91 ps]
						change: [-0.0066% +0.1238% +0.2511%] (p = 0.06 > 0.05)
						No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

mat3_transform_vector2_x4wide
						time:   [430.98 ps 431.02 ps 431.06 ps]
						change: [+1.5248% +1.6230% +1.7210%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

mat4_transform_vector3_x4wide
						time:   [646.64 ps 646.72 ps 646.81 ps]
						change: [-0.0589% -0.0249% +0.0012%] (p = 0.11 > 0.05)
						No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

mat3_transform_point2_x4wide
						time:   [431.82 ps 431.89 ps 431.96 ps]
						change: [+2.0913% +2.1413% +2.1815%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high severe

mat4_transform_point3_x4wide
						time:   [646.37 ps 646.41 ps 646.46 ps]
						change: [+1.5411% +1.5729% +1.6038%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

mat4_transform_vector3_no_division
						time:   [443.87 ps 444.10 ps 444.31 ps]
						change: [-0.6030% -0.4238% -0.2411%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmark results on Intel i7-8565U, Linux, compared to previous one:

$ RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [293.12 ps 295.95 ps 298.98 ps]
						change: [+1.4817% +2.1154% +2.6804%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  15 (15.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high severe

mat4_transform_vector3  time:   [540.54 ps 540.82 ps 541.10 ps]
						change: [-8.7714% -8.5103% -8.2102%] (p = 0.00 < 0.05)
						Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

mat3_transform_point2   time:   [305.57 ps 305.91 ps 306.43 ps]
						change: [+4.2420% +4.6490% +4.9390%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

mat4_transform_point3   time:   [546.64 ps 546.97 ps 547.39 ps]
						change: [-7.5737% -7.2913% -7.0061%] (p = 0.00 < 0.05)
						Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

mat3_transform_vector2_x4wide
						time:   [499.51 ps 499.85 ps 500.30 ps]
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe

mat4_transform_vector3_x4wide
						time:   [774.40 ps 775.65 ps 777.02 ps]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

mat3_transform_point2_x4wide
						time:   [526.26 ps 529.71 ps 535.84 ps]
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

mat4_transform_point3_x4wide
						time:   [796.16 ps 796.74 ps 797.42 ps]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_no_division
						time:   [529.04 ps 530.22 ps 531.70 ps]
						change: [-15.420% -13.707% -12.683%] (p = 0.00 < 0.05)
						Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  7 (7.00%) high mild
  3 (3.00%) high severe

$ bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [307.16 ps 307.34 ps 307.55 ps]
						change: [+4.7516% +5.1517% +5.5680%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high severe

mat4_transform_vector3  time:   [575.44 ps 576.15 ps 576.97 ps]
						change: [+5.2866% +5.6903% +6.0084%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild

mat3_transform_point2   time:   [301.50 ps 301.79 ps 302.15 ps]
						change: [+2.2124% +2.5777% +2.9744%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  4 (4.00%) high mild
  6 (6.00%) high severe

mat4_transform_point3   time:   [584.46 ps 584.88 ps 585.45 ps]
						change: [-4.1562% -3.8892% -3.5962%] (p = 0.00 < 0.05)
						Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  9 (9.00%) high severe

mat3_transform_vector2_x4wide
						time:   [530.59 ps 536.69 ps 544.38 ps]
						change: [+6.3284% +6.9464% +7.6364%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

mat4_transform_vector3_x4wide
						time:   [813.42 ps 814.31 ps 815.26 ps]
						change: [+4.3840% +4.9097% +5.3254%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild

mat3_transform_point2_x4wide
						time:   [523.32 ps 523.71 ps 524.08 ps]
						change: [-5.0939% -2.1361% -0.3320%] (p = 0.08 > 0.05)
						No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

mat4_transform_point3_x4wide
						time:   [842.23 ps 842.86 ps 843.56 ps]
						change: [+5.5553% +5.9587% +6.2648%] (p = 0.00 < 0.05)
						Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low severe
  1 (1.00%) high severe

mat4_transform_vector3_no_division
						time:   [566.75 ps 567.19 ps 567.60 ps]
						change: [-1.3344% -0.9935% -0.4433%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmark results on Apple M2 Max, Linux, compared to previous one:

$ cargo bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2  time:   [315.56 ps 316.70 ps 317.92 ps]
						change: [-0.6503% -0.0844% +0.4491%] (p = 0.76 > 0.05)
						No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

mat4_transform_vector3  time:   [383.68 ps 383.82 ps 384.07 ps]
						change: [-0.0143% +0.1640% +0.4628%] (p = 0.24 > 0.05)
						No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

mat3_transform_point2   time:   [318.32 ps 318.98 ps 319.63 ps]
						change: [+0.9643% +1.4702% +2.0348%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

mat4_transform_point3   time:   [384.04 ps 384.21 ps 384.43 ps]
						change: [+0.1676% +0.2195% +0.2713%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mat3_transform_vector2_x4wide
						time:   [309.13 ps 309.56 ps 310.00 ps]
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

mat4_transform_vector3_x4wide
						time:   [460.39 ps 460.46 ps 460.56 ps]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

mat3_transform_point2_x4wide
						time:   [308.68 ps 309.09 ps 309.52 ps]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

mat4_transform_point3_x4wide
						time:   [460.38 ps 460.42 ps 460.46 ps]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_no_division
						time:   [383.69 ps 383.73 ps 383.78 ps]
						change: [-0.0192% +0.0117% +0.0459%] (p = 0.51 > 0.05)
						No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmark results on Broadcom BCM2711 (Raspberry Pi 4), Linux, compared to previous one:

mat3_transform_vector2  time:   [1.1179 ns 1.1185 ns 1.1191 ns]
						change: [+0.1194% +0.2209% +0.3215%] (p = 0.00 < 0.05)
						Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

mat4_transform_vector3  time:   [1.6815 ns 1.6817 ns 1.6819 ns]
						change: [-0.0434% -0.0159% +0.0102%] (p = 0.26 > 0.05)
						No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

mat3_transform_point2   time:   [1.1170 ns 1.1174 ns 1.1179 ns]
						change: [-0.0609% +0.0024% +0.0661%] (p = 0.94 > 0.05)
						No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

mat4_transform_point3   time:   [1.6817 ns 1.6819 ns 1.6821 ns]
						change: [-0.0088% +0.0325% +0.0857%] (p = 0.20 > 0.05)
						No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

mat3_transform_vector2_x4wide
						time:   [2.2614 ns 2.2618 ns 2.2622 ns]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

mat4_transform_vector3_x4wide
						time:   [3.3940 ns 3.3960 ns 3.3993 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

mat3_transform_point2_x4wide
						time:   [2.2614 ns 2.2617 ns 2.2619 ns]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

mat4_transform_point3_x4wide
						time:   [3.3941 ns 3.3954 ns 3.3970 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

mat4_transform_vector3_no_division
						time:   [1.9712 ns 1.9720 ns 1.9729 ns]
						change: [-0.0164% +0.0278% +0.0815%] (p = 0.26 > 0.05)
						No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

The results are mostly expected: there are no significant regressions in non-SIMD benchmarks; SIMD makes sense even on the potato CPU of Raspberry Pi 4. The only exception is the older Intel CPU (i7-8565U), where transform_vector() and transform_point() slightly regressed for a 2D case, but improved for a 3D case.

It is also possible to leave existing functions as is, and add new SIMD-specific functions instead.

Other semi-related changes included in this PR:

Perspective3::project_vector() fixed so that result matches the result of matrix multiplication.
Added tests for {Orthographic3, Perspective3}::project_vector() (mainly to understand what exactly Perspective3 projection does for a vector).
Added codegen-units = 1 for benchmarks, as otherwise results are less consistent and change after unrelated code changes.
Improved documentation for Matrix*::transform*() and Perspective3:project_*().
Added tests for Matrix*::transform_*().

P.S.: Added SIMD benchmarks require simba with this PR merged: dimforge/simba#76

im-0 · 2025-09-09T16:26:59Z

If you decide to merge this PR, please merge it without squashing into a single commit. Single commit will make benchmarking before/after changes much more difficult.

im-0 · 2025-09-10T20:45:17Z

I asked my friends to run the same benchmarks on other CPUs (preferably Intel) and got some interesting results. It turned out that some benchmarks regressed and some improved on all tested Intel CPUs. The regression was more or less stable across runs, but improvements were seemingly random. Then we tried to disable the Intel Turbo Boost and this "fixed" both regressions and improvements.

My benchmarking routine is basically following:

cpupower frequency-set --governor performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
git checkout $BEFORE_CHANGES
RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" cargo bench --all-features --bench nalgebra_bench -- --save-baseline base _transform_
git checkout $AFTER_CHANGES
RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" cargo bench --all-features --bench nalgebra_bench -- --baseline-lenient base _transform_

click to expand

Intel i9-9900K:

mat3_transform_vector2  time:   [331.96 ps 335.43 ps 338.46 ps]
                        change: [-1.5766% -0.5575% +0.4401%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  15 (15.00%) low severe
  5 (5.00%) low mild

mat4_transform_vector3  time:   [633.70 ps 634.00 ps 634.22 ps]
                        change: [-0.1789% +0.1501% +0.4816%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  9 (9.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

mat3_transform_point2   time:   [333.65 ps 334.05 ps 334.41 ps]
                        change: [-0.6090% -0.2228% +0.1225%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild

mat4_transform_point3   time:   [632.13 ps 632.38 ps 632.60 ps]
                        change: [-0.1452% +0.1086% +0.4231%] (p = 0.54 > 0.05)
                        No change in performance detected.
Found 19 outliers among 100 measurements (19.00%)
  9 (9.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

mat3_transform_vector2_x4wide
                        time:   [623.93 ps 624.03 ps 624.11 ps]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

mat4_transform_vector3_x4wide
                        time:   [921.04 ps 921.18 ps 921.30 ps]
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

mat3_transform_point2_x4wide
                        time:   [622.56 ps 622.84 ps 623.10 ps]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild

mat4_transform_point3_x4wide
                        time:   [921.65 ps 921.76 ps 921.87 ps]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low severe

mat4_transform_vector3_no_division
                        time:   [658.94 ps 658.98 ps 659.03 ps]
                        change: [-0.4661% -0.0747% +0.3524%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc41)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-unknown-linux-gnu
release: 1.89.0
LLVM version: 19.1.7

Intel i7-8565U:

mat3_transform_vector2  time:   [670.49 ps 670.56 ps 670.65 ps]
                        change: [-1.0910% -0.3353% +0.2519%] (p = 0.41 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  3 (3.00%) high mild
  8 (8.00%) high severe

mat4_transform_vector3  time:   [1.2667 ns 1.2670 ns 1.2672 ns]
                        change: [-0.3486% -0.0169% +0.2990%] (p = 0.86 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  5 (5.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

mat3_transform_point2   time:   [670.51 ps 670.60 ps 670.67 ps]
                        change: [-0.6555% -0.1881% +0.2037%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

mat4_transform_point3   time:   [1.2670 ns 1.2671 ns 1.2673 ns]
                        change: [-0.3295% -0.0358% +0.2261%] (p = 0.83 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe

mat3_transform_vector2_x4wide
                        time:   [1.2511 ns 1.2520 ns 1.2534 ns]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_x4wide
                        time:   [1.8466 ns 1.8758 ns 1.9305 ns]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

mat3_transform_point2_x4wide
                        time:   [1.2511 ns 1.2514 ns 1.2519 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

mat4_transform_point3_x4wide
                        time:   [1.8463 ns 1.8467 ns 1.8471 ns]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_no_division
                        time:   [1.3203 ns 1.3211 ns 1.3227 ns]
                        change: [-0.1767% +0.3608% +1.0624%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-unknown-linux-gnu
release: 1.89.0
LLVM version: 20.1.8

Qualcomm Snapdragon SC8280XP:

mat3_transform_vector2  time:   [336.41 ps 336.44 ps 336.47 ps]
                        change: [-0.1865% -0.1207% -0.0586%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  3 (3.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3  time:   [672.13 ps 672.51 ps 672.94 ps]
                        change: [+19.334% +19.508% +19.680%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe

mat3_transform_point2   time:   [368.87 ps 370.03 ps 371.06 ps]
                        change: [-1.7333% -1.2399% -0.7379%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild

mat4_transform_point3   time:   [672.53 ps 672.63 ps 672.74 ps]
                        change: [+19.161% +19.290% +19.392%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

mat3_transform_vector2_x4wide
                        time:   [1.1753 ns 1.1758 ns 1.1763 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_x4wide
                        time:   [1.2727 ns 1.2732 ns 1.2739 ns]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

mat3_transform_point2_x4wide
                        time:   [1.1764 ns 1.1765 ns 1.1766 ns]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

mat4_transform_point3_x4wide
                        time:   [1.2721 ns 1.2724 ns 1.2728 ns]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low severe
  5 (5.00%) low mild
  5 (5.00%) high severe

mat4_transform_vector3_no_division
                        time:   [670.60 ps 670.65 ps 670.70 ps]
                        change: [+19.198% +19.363% +19.482%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: aarch64-unknown-linux-gnu
release: 1.89.0
LLVM version: 20.1.8

Intel i3-1115G4:

# Before
mat3_transform_vector2  time:   [269.47 ps 269.73 ps 269.97 ps]
Found 31 outliers among 100 measurements (31.00%)
  19 (19.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe

mat4_transform_vector3  time:   [715.00 ps 715.42 ps 715.80 ps]
Found 34 outliers among 100 measurements (34.00%)
  24 (24.00%) low severe
  2 (2.00%) high mild
  8 (8.00%) high severe

mat3_transform_point2   time:   [269.38 ps 269.67 ps 269.94 ps]
Found 30 outliers among 100 measurements (30.00%)
  21 (21.00%) low severe
  1 (1.00%) high mild
  8 (8.00%) high severe

mat4_transform_point3   time:   [714.56 ps 715.05 ps 715.52 ps]
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

mat4_transform_vector3_no_division
                        time:   [740.11 ps 740.61 ps 741.10 ps]
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) low severe
  4 (4.00%) low mild

# After
mat3_transform_vector2  time:   [269.35 ps 269.56 ps 269.79 ps]
Found 32 outliers among 100 measurements (32.00%)
  24 (24.00%) low severe
  3 (3.00%) high mild
  5 (5.00%) high severe

mat4_transform_vector3  time:   [714.96 ps 715.42 ps 715.83 ps]
Found 22 outliers among 100 measurements (22.00%)
  11 (11.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

mat3_transform_point2   time:   [270.14 ps 273.05 ps 276.49 ps]
Found 22 outliers among 100 measurements (22.00%)
  5 (5.00%) low severe
  7 (7.00%) low mild
  10 (10.00%) high severe

mat4_transform_point3   time:   [715.12 ps 715.77 ps 716.56 ps]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  8 (8.00%) high severe

mat3_transform_vector2_x4wide
                        time:   [783.04 ps 783.23 ps 783.41 ps]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

mat4_transform_vector3_x4wide
                        time:   [867.52 ps 879.03 ps 892.57 ps]
Found 23 outliers among 100 measurements (23.00%)
  5 (5.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high mild
  12 (12.00%) high severe

mat3_transform_point2_x4wide
                        time:   [783.32 ps 784.07 ps 785.11 ps]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  3 (3.00%) high mild
  6 (6.00%) high severe

mat4_transform_point3_x4wide
                        time:   [871.11 ps 872.93 ps 874.44 ps]
Found 20 outliers among 100 measurements (20.00%)
  11 (11.00%) low severe
  7 (7.00%) low mild
  2 (2.00%) high severe

mat4_transform_vector3_no_division
                        time:   [740.93 ps 746.02 ps 752.11 ps]
Found 27 outliers among 100 measurements (27.00%)
  9 (9.00%) low severe
  5 (5.00%) low mild
  13 (13.00%) high severe

fr0@calculate ~/nalgebra $ rustc -vV
rustc 1.88.0 (6b00bc388 2025-06-23) (gentoo)
binary: rustc
commit-hash: 6b00bc3880198600130e1cf62b8f8a93494488cc
commit-date: 2025-06-23
host: x86_64-unknown-linux-gnu
release: 1.88.0
LLVM version: 20.1.7

Intel Core Ultra 5 135H:

mat3_transform_vector2  time:   [229.52 ps 239.71 ps 250.38 ps]
                        change: [+12.450% +18.380% +24.916%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

mat4_transform_vector3  time:   [304.63 ps 316.86 ps 330.10 ps]
                        change: [+0.7303% +7.5166% +15.112%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

mat3_transform_point2   time:   [202.05 ps 208.41 ps 215.52 ps]
                        change: [-1.5674% +5.0112% +12.211%] (p = 0.14 > 0.05)
                        No change in performance detected.

mat4_transform_point3   time:   [298.05 ps 309.35 ps 320.75 ps]
                        change: [-4.2077% +1.6736% +7.5483%] (p = 0.57 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mat3_transform_vector2_x4wide
                        time:   [535.45 ps 546.86 ps 559.21 ps]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

mat4_transform_vector3_x4wide
                        time:   [569.78 ps 585.57 ps 601.68 ps]

mat3_transform_point2_x4wide
                        time:   [540.24 ps 554.80 ps 570.01 ps]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mat4_transform_point3_x4wide
                        time:   [547.72 ps 560.44 ps 573.66 ps]
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

mat4_transform_vector3_no_division
                        time:   [553.34 ps 569.96 ps 587.30 ps]
                        change: [+65.222% +75.055% +85.760%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

rustc 1.89.0 (29483883e 2025-08-04)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-pc-windows-msvc
release: 1.89.0 LLVM version: 20.1.7

Given the Turbo Boost-related weirdness and a significant consistent regression on Qualcomm SC8280XP, I think it will be better to just add separate SIMD-supporting functions. Will update this PR soon.

This makes Perspective3::project_vector() consistent with the same perspective projection applied by just multiplying the underlying matrix by source vector converted to homogeneous coordinates.

During benchmarking I found that `codegen-units` with default value leads to inconsistent results across recompilations (clean vs. incremental). Also, sometimes it leads to a significant performance degradation of benchmarks unrelated to code changes.

Following benchmarks were perfomed on Linux after `cpupower frequency-set --governor performance` and with default RUSTFLAGS. AMD Ryzen 9 5950X: mat3_transform_vector2 time: [314.35 ps 314.44 ps 314.56 ps] Found 11 outliers among 100 measurements (11.00%) 3 (3.00%) high mild 8 (8.00%) high severe mat4_transform_vector3 time: [440.44 ps 440.95 ps 441.45 ps] Found 13 outliers among 100 measurements (13.00%) 1 (1.00%) high mild 12 (12.00%) high severe mat3_transform_point2 time: [314.40 ps 314.48 ps 314.60 ps] Found 9 outliers among 100 measurements (9.00%) 4 (4.00%) high mild 5 (5.00%) high severe mat4_transform_point3 time: [436.98 ps 437.03 ps 437.08 ps] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 2 (2.00%) high mild 5 (5.00%) high severe mat3_transform_vector2_x4wide time: [422.74 ps 422.85 ps 422.98 ps] Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) low mild 3 (3.00%) high mild 5 (5.00%) high severe mat4_transform_vector3_x4wide time: [635.36 ps 635.49 ps 635.63 ps] Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe mat3_transform_point2_x4wide time: [422.21 ps 422.31 ps 422.47 ps] Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) low mild 4 (4.00%) high mild 6 (6.00%) high severe mat4_transform_point3_x4wide time: [635.19 ps 635.27 ps 635.37 ps] Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low severe 4 (4.00%) high mild 5 (5.00%) high severe mat4_transform_vector3_no_division time: [439.86 ps 439.96 ps 440.07 ps] Found 13 outliers among 100 measurements (13.00%) 2 (2.00%) low severe 1 (1.00%) low mild 4 (4.00%) high mild 6 (6.00%) high severe Intel Core i7-8565U: mat3_transform_vector2 time: [294.41 ps 294.52 ps 294.63 ps] Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low severe 4 (4.00%) high mild 2 (2.00%) high severe mat4_transform_vector3 time: [583.88 ps 587.06 ps 592.15 ps] Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) low severe 1 (1.00%) low mild 3 (3.00%) high mild 12 (12.00%) high severe mat3_transform_point2 time: [309.23 ps 309.52 ps 309.88 ps] Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) low severe 2 (2.00%) high mild 5 (5.00%) high severe mat4_transform_point3 time: [557.59 ps 558.26 ps 559.52 ps] Found 15 outliers among 100 measurements (15.00%) 3 (3.00%) low severe 7 (7.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe mat3_transform_vector2_x4wide time: [557.77 ps 558.22 ps 558.75 ps] Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low severe 1 (1.00%) low mild 1 (1.00%) high severe mat4_transform_vector3_x4wide time: [801.89 ps 802.37 ps 802.89 ps] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 4 (4.00%) high mild 3 (3.00%) high severe mat3_transform_point2_x4wide time: [574.76 ps 575.10 ps 575.44 ps] Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) low severe 3 (3.00%) low mild 1 (1.00%) high mild 6 (6.00%) high severe mat4_transform_point3_x4wide time: [801.65 ps 802.63 ps 803.94 ps] Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild 4 (4.00%) high severe mat4_transform_vector3_no_division time: [568.86 ps 569.60 ps 570.52 ps] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe Apple M2 Max: mat3_transform_vector2 time: [316.93 ps 317.90 ps 318.91 ps] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe mat4_transform_vector3 time: [383.72 ps 383.77 ps 383.83 ps] Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe mat3_transform_point2 time: [317.60 ps 318.44 ps 319.31 ps] Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 2 (2.00%) high mild mat4_transform_point3 time: [383.75 ps 383.78 ps 383.81 ps] Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe mat3_transform_vector2_x4wide time: [308.93 ps 309.36 ps 309.80 ps] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe mat4_transform_vector3_x4wide time: [460.46 ps 460.50 ps 460.55 ps] Found 16 outliers among 100 measurements (16.00%) 6 (6.00%) low mild 5 (5.00%) high mild 5 (5.00%) high severe mat3_transform_point2_x4wide time: [308.88 ps 309.27 ps 309.69 ps] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild mat4_transform_point3_x4wide time: [460.48 ps 460.52 ps 460.57 ps] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe mat4_transform_vector3_no_division time: [383.77 ps 383.86 ps 383.98 ps] Found 9 outliers among 100 measurements (9.00%) 4 (4.00%) high mild 5 (5.00%) high severe Broadcom BCM2711 (Raspberry Pi 4): mat3_transform_vector2 time: [1.1169 ns 1.1172 ns 1.1175 ns] Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe mat4_transform_vector3 time: [1.6819 ns 1.6821 ns 1.6824 ns] Found 11 outliers among 100 measurements (11.00%) 8 (8.00%) high mild 3 (3.00%) high severe mat3_transform_point2 time: [1.1168 ns 1.1169 ns 1.1171 ns] Found 10 outliers among 100 measurements (10.00%) 7 (7.00%) high mild 3 (3.00%) high severe mat4_transform_point3 time: [1.6818 ns 1.6820 ns 1.6823 ns] Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe mat3_transform_vector2_x4wide time: [2.2615 ns 2.2619 ns 2.2624 ns] Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe mat4_transform_vector3_x4wide time: [3.3941 ns 3.3947 ns 3.3954 ns] Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe mat3_transform_point2_x4wide time: [2.2615 ns 2.2619 ns 2.2622 ns] Found 8 outliers among 100 measurements (8.00%) 7 (7.00%) high mild 1 (1.00%) high severe mat4_transform_point3_x4wide time: [3.3943 ns 3.3957 ns 3.3973 ns] Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) low mild 4 (4.00%) high mild 6 (6.00%) high severe mat4_transform_vector3_no_division time: [1.9711 ns 1.9719 ns 1.9728 ns] Found 11 outliers among 100 measurements (11.00%) 5 (5.00%) high mild 6 (6.00%) high severe rustc -vV rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42) binary: rustc commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2 commit-date: 2025-08-04 host: x86_64-unknown-linux-gnu release: 1.89.0 LLVM version: 20.1.8 rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42) binary: rustc commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2 commit-date: 2025-08-04 host: aarch64-unknown-linux-gnu release: 1.89.0 LLVM version: 20.1.8

…ect_*()

im-0 · 2025-09-10T23:10:34Z

Done! Still requires Simba with dimforge/simba#76

im-0 · 2025-09-13T04:40:01Z

FYI: I filed a Rust issue about performance regressions with fat LTO and default codegen-units: rust-lang/rust#146497

im-0 · 2025-09-23T21:52:15Z

All benchmark that I did are invalid because of this: #1547 😕

im-0 added 9 commits September 11, 2025 00:33

test: add tests for Matrix*:transform_*()

55685fc

fix: negate z component in Perspective3::project_vector()

bd9b913

This makes Perspective3::project_vector() consistent with the same perspective projection applied by just multiplying the underlying matrix by source vector converted to homogeneous coordinates.

test: add tests for {Orthographic3, Perspective3}::project_vector()

7323d18

test: add benchmarks for Matrix*:transform*()

7b7a81b

feat: AoSoA SIMD Matrix*:simd_transform_*()

9fba307

docs: document results of Matrix*::transform*() and Perspective3:proj…

0b614ac

…ect_*()

test: SIMD test for Matrix*::simd_transform_vector()

fbba657

im-0 force-pushed the improve-cg-transform-and-project branch from c6e5a14 to fbba657 Compare September 10, 2025 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SIMD CG transform and related improvements #1543

SIMD CG transform and related improvements #1543

Uh oh!

im-0 commented Sep 9, 2025

Uh oh!

im-0 commented Sep 9, 2025

Uh oh!

im-0 commented Sep 10, 2025 •

edited

Loading

Uh oh!

im-0 commented Sep 10, 2025

Uh oh!

im-0 commented Sep 13, 2025 •

edited

Loading

Uh oh!

im-0 commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

SIMD CG transform and related improvements #1543

Are you sure you want to change the base?

SIMD CG transform and related improvements #1543

Uh oh!

Conversation

im-0 commented Sep 9, 2025

Uh oh!

im-0 commented Sep 9, 2025

Uh oh!

im-0 commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

im-0 commented Sep 10, 2025

Uh oh!

im-0 commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

im-0 commented Sep 23, 2025

Uh oh!

Uh oh!

im-0 commented Sep 10, 2025 •

edited

Loading

im-0 commented Sep 13, 2025 •

edited

Loading