Skip to content

Conversation

@Spartan322
Copy link
Member

@Spartan322 Spartan322 commented Oct 31, 2025

This should slightly improve the cache-friendlyness of the pixel search at the memory cost of a practical copy of the province pixel map. Did not benchmark performance but on Linux built from GCC the relative cost of load_map_images appears to shrink.

@Spartan322 Spartan322 marked this pull request as draft October 31, 2025 00:19
@schombert
Copy link

This is not more cache friendly. In creating the new copy, you load all of the existing data into the cache, plus you end up with the additional new copy in the cache. So you end up with the same cost as the existing version, of loading the same data into the cache (if it wasn't there already) plus you consume cache space for the new copy, possibly evicting something useful in the process.

@Spartan322
Copy link
Member Author

Well the profiling I have actually done shows a consistent decrease of relative time spent in load_map_images with this being the only change, the data is processed into a linear allocation which would inherently speed up cache access.

@schombert
Copy link

The version prior to your change also has the data in a linear allocation -- see the implementation of the colour_at function and the implementation of get_pixel_index_from_pos which is return pos.x + pos.y * dims.x;; So the line
const colour_t province_colour = colour_at(province_data, pixel_index);
already resolved to

idx = get_pixel_index_from_pos(pos) * 3;
province_colour = { colour_data[idx + 2], colour_data[idx + 1], colour_data[idx] };

which is the exact same lookup pattern in terms of indexing into a contiguous linear allocation as the new version, but without making an additional copy.

@schombert
Copy link

One difference between the two is that colour_t stores an internal 32bit integer, meaning that there are 4 bytes per pixel stored instead of 3 bytes per pixel in the source data, meaning that the copy wastes an additional byte of space per pixel, making it less cache friendly. However, maybe the pattern of spacing the data in 4 byte strides instead of 3 byte strides enabled the compiler to do some optimization with the copy it couldn't do with the original version.

@Spartan322
Copy link
Member Author

Regardless the improvement while notably consistent is still fairly marginal and we only perform this calculation once, it did shrink the function in the flame graph, but I'm still undecided if the improvement is worth it hence why it was left as a draft, probably better if it was processing it with SIMD operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants