Fix case where `get_partial_current_state_deltas` could return >100 rows #18960

anoadragon453 · 2025-09-23T10:43:09Z

get_partial_current_state_deltas is supposed to return max 100 rows. We relied on this assumption in #18926, forgoing batching of state deltas. It turns out that get_partial_current_state_deltas can potentially return millions of rows, and we saw this in production on element.io.

This PR fixes the case where every state delta in current_state_delta_stream has a count of 1, meaning total will be 100. Before this change, that would result in clipped_stream_id = max_stream_id, meaning we'd potentially pull out millions of rows.

I haven't written a regression test for this and I'm not sure how to easily seed current_state_delta_stream with the correct data. I suppose we could create a room and change the topic >100 times? Alternatively we could write directly to the table, but such a test would need to be updated if the schema changes. I was able to write a unit test without writing SQL directly.

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

This fixes the case where every state delta in `current_state_delta_stream` has a count of 1, meaning `total` will be 100. Before this change, that would result in `clipped_stream_id = max_stream_id`, meaning we'd potentially pull out millions of rows.

tests/storage/test_state.py

reivilibre

nothing materially wrong with this, just some thoughts, what do you think?

tests/storage/test_state.py

reivilibre · 2025-09-24T11:15:17Z

synapse/storage/databases/main/state_deltas.py


            total = 0

            for stream_id, count in txn:


wondering if instead of doing this with a for loop, we should just ask the DB for the result we want directly?

I think this would do it, as long as you still handle the case where there are no result rows (less than 100 left, so no need to clamp)

SELECT stream_id FROM current_state_delta_stream WHERE stream_id > ? AND stream_id <= ? ORDER BY stream_id ASC LIMIT 1 OFFSET 99

I actually just ended up using LIMIT in the second query in 0e40bf4.

Am I missing something - is it really this simple? All the unit tests appear to pass 🤷

@reivilibre

Credit to @reivilibre for the idea.

Simplify the queries to just a single one.

anoadragon453 · 2025-09-25T17:25:35Z

Complement CI is failing due to a flaky test.

Edit: And sytest appears to just be stalling...

reivilibre

I mean.... it's weird if the tests don't catch this, probably a smell that we're missing a test

reivilibre · 2025-09-26T10:09:06Z

synapse/storage/databases/main/state_deltas.py

-            # don't select toooo many.
            sql = """
-                SELECT stream_id, count(*)
+                SELECT stream_id, room_id, type, state_key, event_id, prev_event_id


afraid this doesn't work AFAICT, because stream_id is not unique, but we always want to process the entire state with the same stream_id (never process some of the rows but not others)

So would this mean that if a group of state deltas with the same stream_id that was >100 items would cause this function to deadlock?

I suppose we should just return the whole group, even if it's larger than limit, in that case.

Edit: especially since the docstring states that if the list of returned state deltas is empty, then we are up to date. Potentially skipping over a lot of state deltas!

We now: 1. Group state deltas by `stream_id` and get their count 2. Count (in the DB) until we go over our limit. We then get the `stream_id` we could naively go up to, as well as the clamped `stream_id` which would keep us under our limit. The second query fetches rows up to and including the clamped `stream_id`. We also add a unit test that injects multiple state deltas with the same `stream_id`, which correctly failed when tested against the previous implementation in this PR.

anoadragon453 added 2 commits September 23, 2025 11:39

newsfile

6557703

anoadragon453 marked this pull request as ready for review September 23, 2025 10:45

anoadragon453 requested a review from a team as a code owner September 23, 2025 10:45

Add regression unit test

e771351

anoadragon453 commented Sep 23, 2025

View reviewed changes

tests/storage/test_state.py Show resolved Hide resolved

reivilibre requested changes Sep 24, 2025

View reviewed changes

anoadragon453 added 2 commits September 25, 2025 16:41

Add a limit parameter to speed up unit tests

24ff9f1

Credit to @reivilibre for the idea.

Just use LIMIT on the DB

0e40bf4

Simplify the queries to just a single one.

anoadragon453 requested a review from reivilibre September 25, 2025 18:02

reivilibre requested changes Sep 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix case where `get_partial_current_state_deltas` could return >100 rows #18960

Fix case where `get_partial_current_state_deltas` could return >100 rows #18960

anoadragon453 commented Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

reivilibre left a comment

Uh oh!

Uh oh!

reivilibre Sep 24, 2025

Uh oh!

anoadragon453 Sep 25, 2025

Uh oh!

anoadragon453 commented Sep 25, 2025 •

edited

Loading

Uh oh!

reivilibre left a comment

Uh oh!

reivilibre Sep 26, 2025

Uh oh!

anoadragon453 Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix case where get_partial_current_state_deltas could return >100 rows #18960

Are you sure you want to change the base?

Fix case where get_partial_current_state_deltas could return >100 rows #18960

Conversation

anoadragon453 commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Uh oh!

Uh oh!

reivilibre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

reivilibre Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

anoadragon453 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

anoadragon453 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reivilibre left a comment

Choose a reason for hiding this comment

Uh oh!

reivilibre Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

anoadragon453 Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fix case where `get_partial_current_state_deltas` could return >100 rows #18960

Fix case where `get_partial_current_state_deltas` could return >100 rows #18960

anoadragon453 commented Sep 23, 2025 •

edited

Loading

anoadragon453 commented Sep 25, 2025 •

edited

Loading

anoadragon453 Sep 26, 2025 •

edited

Loading