-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Description
Background
We currently use sort_by_key to sort PhaseItems in the sort phase, and sort_unstable_by to sort extracted sprites in queue_sprites.
sort_by_key- This sort is stable (i.e., does not reorder equal elements) and O(m * n * log(n)) worst-case, where the key function is O(m).sort_unstable_by- This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case.
For large numbers of items to sort, such as in bevymark or many_sprites where n > 100_000, and given sort keys that are a small and fixed number of bits such as the f32 that we are commonly using, a radix sort could/should be much faster.
crates
I investigated sorting 10M random f32s in the range 0.1f32..1000.0f32 (the default perspective camera near and far planes) and observed the following with the available radix sort crates:
1053.388ms stable
391.499ms unstable
60.359ms rdst_single
16.067ms rdst
79.550ms voracious_sort
79.945ms voracious_stable_sort
13.895ms voracious_mt_sort
65.653ms radsort
- multi-threading
rdstandvoracious_mt_sortresults above are multi-threaded usingrayon, all the rest of the results are single-threadedrdsthas a hard dependency onrayon, though it can be configured to do single-threaded sorting. I created an issue about making multi-threading optional as we already have a threadpool in bevy and maybe we don't want to add another: Make rayon optional nessex/rdst#2voracious_radix_sorthas a non-default multi-threading featureradsortis single-threaded only
- According to crates.io, the crates have the following footprints:
radsort- 17.1kBrdst- 34.3kBvoracious_radix_sort- 178kB
- Flexibility
radsortwas the easiest to integrate using itssort_by_keyfunctionrdstandvoracious_radix_sortboth require implementation of some traits on the type being sorted, and require that the type implementCopy. This cannot be done for batched sprites currently becauseBatchedPhaseItemcontains aRange<f32>for which it is not possible to implementCopy
Proof of Concept
I made a branch here that uses radsort: https://github.com/superdump/bevy/tree/render-radix-sort .
The sorting of ExtractedSprites in queue_sprites is significantly improved by using radsort::sort_by_key like this:
radsort::sort_by_key(extracted_sprites, |extracted_sprite| {
(
extracted_sprite.transform.translation.z,
match extracted_sprite.image_handle_id {
HandleId::Id(uuid, id) => {
((uuid.as_u128() & ((1 << 64) - 1)) << 64) | (id as u128)
}
HandleId::AssetPathId(id) => {
((id.source_path_id().value() as u128) << 64)
| (id.label_id().value() as u128)
}
},
)
});With that, on an M1 Max, the median execution time of queue_sprites in bevymark -- 20000 8 over 1500 frames increased from 9.82ms to 17.09ms, which makes no sense to me. In many_sprites it decreased from 11.34ms to 9.47ms.
The median execution time of sort_phase_system for the relevant phase (Transparent2d for sprites, Opaque3d for 3D meshes) in many_cubes -- sphere it decreased from 0.728ms to 0.094ms. In bevymark -- 20000 8 it increased from 0.184ms to 1.95ms, which makes no sense. And in many_sprites it increased from 0.106ms to 1.19ms.
This is quite confusing. I haven't been able to figure out why the sort performance gets worse. radsort claims both best and worst execution time of O(n), space complexity of O(n), and that it is a stable sort so it does not reorder equal items. On the branch, all the sort_key implementations are inlined.
Next steps
- Figure out why
radsortis much slower in many cases - Try
voracious_radix_sortorrdstto see if they perform consistently better, thoughrdstwould likely not be approved unless its multi-threading were made optional.- Make multi-threading optional in
rdstif it turns out to be a preferable solution due to the smaller crate footprint
- Make multi-threading optional in
Metadata
Metadata
Assignees
Labels
Type
Projects
Status