Skip to content

Conversation

anoadragon453
Copy link
Member

@anoadragon453 anoadragon453 commented Sep 23, 2025

get_partial_current_state_deltas is supposed to return max 100 rows. We relied on this assumption in #18926, forgoing batching of state deltas. It turns out that get_partial_current_state_deltas can potentially return millions of rows, and we saw this in production on element.io.

This PR fixes the case where every state delta in current_state_delta_stream has a count of 1, meaning total will be 100. Before this change, that would result in clipped_stream_id = max_stream_id, meaning we'd potentially pull out millions of rows.

I haven't written a regression test for this and I'm not sure how to easily seed current_state_delta_stream with the correct data. I suppose we could create a room and change the topic >100 times? Alternatively we could write directly to the table, but such a test would need to be updated if the schema changes. I was able to write a unit test without writing SQL directly.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

This fixes the case where every state delta in
`current_state_delta_stream` has a count of 1, meaning `total` will be
100.

Before this change, that would result in `clipped_stream_id =
max_stream_id`, meaning we'd potentially pull out millions of rows.
@anoadragon453 anoadragon453 marked this pull request as ready for review September 23, 2025 10:45
@anoadragon453 anoadragon453 requested a review from a team as a code owner September 23, 2025 10:45
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing materially wrong with this, just some thoughts, what do you think?


total = 0

for stream_id, count in txn:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if instead of doing this with a for loop, we should just ask the DB for the result we want directly?

I think this would do it, as long as you still handle the case where there are no result rows (less than 100 left, so no need to clamp)

SELECT stream_id
FROM current_state_delta_stream
WHERE stream_id > ? AND stream_id <= ?
ORDER BY stream_id ASC
LIMIT 1 OFFSET 99

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually just ended up using LIMIT in the second query in 0e40bf4.

Am I missing something - is it really this simple? All the unit tests appear to pass 🤷

Simplify the queries to just a single one.
@anoadragon453
Copy link
Member Author

anoadragon453 commented Sep 25, 2025

Complement CI is failing due to a flaky test.

Edit: And sytest appears to just be stalling...

Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean.... it's weird if the tests don't catch this, probably a smell that we're missing a test

# don't select toooo many.
sql = """
SELECT stream_id, count(*)
SELECT stream_id, room_id, type, state_key, event_id, prev_event_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afraid this doesn't work AFAICT, because stream_id is not unique, but we always want to process the entire state with the same stream_id (never process some of the rows but not others)

Copy link
Member Author

@anoadragon453 anoadragon453 Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So would this mean that if a group of state deltas with the same stream_id that was >100 items would cause this function to deadlock?

I suppose we should just return the whole group, even if it's larger than limit, in that case.

Edit: especially since the docstring states that if the list of returned state deltas is empty, then we are up to date. Potentially skipping over a lot of state deltas!

We now:

1. Group state deltas by `stream_id` and get their count
2. Count (in the DB) until we go over our limit.

We then get the `stream_id` we could naively go up to, as well as the
clamped `stream_id` which would keep us under our limit.

The second query fetches rows up to and including the clamped
`stream_id`.

We also add a unit test that injects multiple state deltas with the same
`stream_id`, which correctly failed when tested against the previous
implementation in this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants