-
-
Notifications
You must be signed in to change notification settings - Fork 512
SIMD CG transform and related improvements #1543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
If you decide to merge this PR, please merge it without squashing into a single commit. Single commit will make benchmarking before/after changes much more difficult. |
I asked my friends to run the same benchmarks on other CPUs (preferably Intel) and got some interesting results. It turned out that some benchmarks regressed and some improved on all tested Intel CPUs. The regression was more or less stable across runs, but improvements were seemingly random. Then we tried to disable the Intel Turbo Boost and this "fixed" both regressions and improvements. My benchmarking routine is basically following:
click to expandIntel i9-9900K:
Intel i7-8565U:
Qualcomm Snapdragon SC8280XP:
Intel i3-1115G4:
Intel Core Ultra 5 135H:
Given the Turbo Boost-related weirdness and a significant consistent regression on Qualcomm SC8280XP, I think it will be better to just add separate SIMD-supporting functions. Will update this PR soon. |
This makes Perspective3::project_vector() consistent with the same perspective projection applied by just multiplying the underlying matrix by source vector converted to homogeneous coordinates.
During benchmarking I found that `codegen-units` with default value leads to inconsistent results across recompilations (clean vs. incremental). Also, sometimes it leads to a significant performance degradation of benchmarks unrelated to code changes.
Following benchmarks were perfomed on Linux after `cpupower frequency-set --governor performance` and with default RUSTFLAGS. AMD Ryzen 9 5950X: mat3_transform_vector2 time: [314.35 ps 314.44 ps 314.56 ps] Found 11 outliers among 100 measurements (11.00%) 3 (3.00%) high mild 8 (8.00%) high severe mat4_transform_vector3 time: [440.44 ps 440.95 ps 441.45 ps] Found 13 outliers among 100 measurements (13.00%) 1 (1.00%) high mild 12 (12.00%) high severe mat3_transform_point2 time: [314.40 ps 314.48 ps 314.60 ps] Found 9 outliers among 100 measurements (9.00%) 4 (4.00%) high mild 5 (5.00%) high severe mat4_transform_point3 time: [436.98 ps 437.03 ps 437.08 ps] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 2 (2.00%) high mild 5 (5.00%) high severe mat3_transform_vector2_x4wide time: [422.74 ps 422.85 ps 422.98 ps] Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) low mild 3 (3.00%) high mild 5 (5.00%) high severe mat4_transform_vector3_x4wide time: [635.36 ps 635.49 ps 635.63 ps] Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe mat3_transform_point2_x4wide time: [422.21 ps 422.31 ps 422.47 ps] Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) low mild 4 (4.00%) high mild 6 (6.00%) high severe mat4_transform_point3_x4wide time: [635.19 ps 635.27 ps 635.37 ps] Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low severe 4 (4.00%) high mild 5 (5.00%) high severe mat4_transform_vector3_no_division time: [439.86 ps 439.96 ps 440.07 ps] Found 13 outliers among 100 measurements (13.00%) 2 (2.00%) low severe 1 (1.00%) low mild 4 (4.00%) high mild 6 (6.00%) high severe Intel Core i7-8565U: mat3_transform_vector2 time: [294.41 ps 294.52 ps 294.63 ps] Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low severe 4 (4.00%) high mild 2 (2.00%) high severe mat4_transform_vector3 time: [583.88 ps 587.06 ps 592.15 ps] Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) low severe 1 (1.00%) low mild 3 (3.00%) high mild 12 (12.00%) high severe mat3_transform_point2 time: [309.23 ps 309.52 ps 309.88 ps] Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) low severe 2 (2.00%) high mild 5 (5.00%) high severe mat4_transform_point3 time: [557.59 ps 558.26 ps 559.52 ps] Found 15 outliers among 100 measurements (15.00%) 3 (3.00%) low severe 7 (7.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe mat3_transform_vector2_x4wide time: [557.77 ps 558.22 ps 558.75 ps] Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low severe 1 (1.00%) low mild 1 (1.00%) high severe mat4_transform_vector3_x4wide time: [801.89 ps 802.37 ps 802.89 ps] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 4 (4.00%) high mild 3 (3.00%) high severe mat3_transform_point2_x4wide time: [574.76 ps 575.10 ps 575.44 ps] Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) low severe 3 (3.00%) low mild 1 (1.00%) high mild 6 (6.00%) high severe mat4_transform_point3_x4wide time: [801.65 ps 802.63 ps 803.94 ps] Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild 4 (4.00%) high severe mat4_transform_vector3_no_division time: [568.86 ps 569.60 ps 570.52 ps] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe Apple M2 Max: mat3_transform_vector2 time: [316.93 ps 317.90 ps 318.91 ps] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe mat4_transform_vector3 time: [383.72 ps 383.77 ps 383.83 ps] Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe mat3_transform_point2 time: [317.60 ps 318.44 ps 319.31 ps] Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 2 (2.00%) high mild mat4_transform_point3 time: [383.75 ps 383.78 ps 383.81 ps] Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe mat3_transform_vector2_x4wide time: [308.93 ps 309.36 ps 309.80 ps] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe mat4_transform_vector3_x4wide time: [460.46 ps 460.50 ps 460.55 ps] Found 16 outliers among 100 measurements (16.00%) 6 (6.00%) low mild 5 (5.00%) high mild 5 (5.00%) high severe mat3_transform_point2_x4wide time: [308.88 ps 309.27 ps 309.69 ps] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild mat4_transform_point3_x4wide time: [460.48 ps 460.52 ps 460.57 ps] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe mat4_transform_vector3_no_division time: [383.77 ps 383.86 ps 383.98 ps] Found 9 outliers among 100 measurements (9.00%) 4 (4.00%) high mild 5 (5.00%) high severe Broadcom BCM2711 (Raspberry Pi 4): mat3_transform_vector2 time: [1.1169 ns 1.1172 ns 1.1175 ns] Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe mat4_transform_vector3 time: [1.6819 ns 1.6821 ns 1.6824 ns] Found 11 outliers among 100 measurements (11.00%) 8 (8.00%) high mild 3 (3.00%) high severe mat3_transform_point2 time: [1.1168 ns 1.1169 ns 1.1171 ns] Found 10 outliers among 100 measurements (10.00%) 7 (7.00%) high mild 3 (3.00%) high severe mat4_transform_point3 time: [1.6818 ns 1.6820 ns 1.6823 ns] Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe mat3_transform_vector2_x4wide time: [2.2615 ns 2.2619 ns 2.2624 ns] Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe mat4_transform_vector3_x4wide time: [3.3941 ns 3.3947 ns 3.3954 ns] Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe mat3_transform_point2_x4wide time: [2.2615 ns 2.2619 ns 2.2622 ns] Found 8 outliers among 100 measurements (8.00%) 7 (7.00%) high mild 1 (1.00%) high severe mat4_transform_point3_x4wide time: [3.3943 ns 3.3957 ns 3.3973 ns] Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) low mild 4 (4.00%) high mild 6 (6.00%) high severe mat4_transform_vector3_no_division time: [1.9711 ns 1.9719 ns 1.9728 ns] Found 11 outliers among 100 measurements (11.00%) 5 (5.00%) high mild 6 (6.00%) high severe rustc -vV rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42) binary: rustc commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2 commit-date: 2025-08-04 host: x86_64-unknown-linux-gnu release: 1.89.0 LLVM version: 20.1.8 rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42) binary: rustc commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2 commit-date: 2025-08-04 host: aarch64-unknown-linux-gnu release: 1.89.0 LLVM version: 20.1.8
c6e5a14
to
fbba657
Compare
Done! Still requires Simba with dimforge/simba#76 |
FYI: I filed a Rust issue about performance regressions with fat LTO and default |
All benchmark that I did are invalid because of this: #1547 😕 |
Hi!
I am trying to write a software 3D rasterizer and want to use
nalgebra
with SIMD as a math library. Here I tried to add AoSoA SIMD support forMatrix4::transform_point()
and friends. I also benchmarked my modifications to make sure that I didn't regressed anything, and to make sure that SIMD support makes sense here at all.Benchamrks were performed on various CPUs that I have. Here are the results:
click to expand
Benchmark results on AMD Ryzen 9 5950X, Linux, compared to previous one:
Benchmark results on Intel i7-8565U, Linux, compared to previous one:
Benchmark results on Apple M2 Max, Linux, compared to previous one:
Benchmark results on Broadcom BCM2711 (Raspberry Pi 4), Linux, compared to previous one:
The results are mostly expected: there are no significant regressions in non-SIMD benchmarks; SIMD makes sense even on the potato CPU of Raspberry Pi 4. The only exception is the older Intel CPU (i7-8565U), where
transform_vector()
andtransform_point()
slightly regressed for a 2D case, but improved for a 3D case.It is also possible to leave existing functions as is, and add new SIMD-specific functions instead.
Other semi-related changes included in this PR:
Perspective3::project_vector()
fixed so that result matches the result of matrix multiplication.{Orthographic3, Perspective3}::project_vector()
(mainly to understand what exactlyPerspective3
projection does for a vector).codegen-units = 1
for benchmarks, as otherwise results are less consistent and change after unrelated code changes.Matrix*::transform_*()
.P.S.: Added SIMD benchmarks require
simba
with this PR merged: dimforge/simba#76