ref(span-buffer): Add backpressure #91707

untitaker · 2025-05-15T11:16:58Z

If the flusher dies, the consumer will crash too instead of filling up redis further. I believe this is what we recently saw ~13h ago in US: A flusher died because of a kafka blip, but the consumer kept filling up a particular shard.
If the flusher is not fast enough and there are too many segments that would be "ready" to flush, we apply backpressure.
If redis OOMs, we stop the entire consumer and therefore prevent data from being flushed out prematurely.

I'm a bit concerned that point 3 is not acceptable for oncall since it basically stops the consumer until redis is scaled up (unlike other memory-based backpressure where we eventually recover automatically), but from a product POV it's also bad to flush partial segments prematurely. In practice we'll have to overprovision redis' memory, and set early alerts to scale up redis soon enough.

…kpressure

codecov · 2025-05-15T11:33:52Z

Codecov Report

Attention: Patch coverage is 80.00000% with 17 lines in your changes missing coverage. Please review.

⚠️ Parser warning

The parser emitted a warning. Please review your JUnit XML file:

Warning while parsing testcase attributes: Limit of string is 1000 chars, for name, we got 2083 at 1:157236 in /home/runner/work/sentry/sentry/.artifacts/pytest.junit.xml

Files with missing lines	Patch %	Lines
src/sentry/spans/consumers/process/flusher.py	52.94%	16 Missing ⚠️
src/sentry/processing/backpressure/memory.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #91707      +/-   ##
==========================================
+ Coverage   87.62%   87.67%   +0.04%     
==========================================
  Files       10353    10342      -11     
  Lines      586779   586792      +13     
  Branches    22585    22528      -57     
==========================================
+ Hits       514194   514457     +263     
+ Misses      72157    71892     -265     
- Partials      428      443      +15

fpacifici · 2025-05-16T17:52:16Z

src/sentry/spans/buffer.py

+                    key = self._get_queue_key(shard)
+                    p.zcard(key)


Please consider measuring this in bytes rather than set cardinality as spans can be very different from each other.

A way to do this is to keep a sharded counter of the size of what you write, when you add a batch of spans you increment, when you flush you reduce. Then you can rely on the amount of data in redis rather than the number of spans.

Or, better, measure the amount of free memory in redis like backpressure does. Then you cannot go wrong with the signal to trigger backpressure.

done. also had a conversation with jan and had to split up into two kinds of backpressure. see updated pr description

jan-auer · 2025-05-20T09:09:00Z

src/sentry/spans/consumers/process/flusher.py

+            used = sum(x.used for x in memory_infos)
+            available = sum(x.available for x in memory_infos)
+            if available > 0 and used / available > self.max_memory_percentage:
+                logger.fatal("Pausing consumer due to Redis being full")


Let's debounce this so we don't spam Sentry with too many errors

jan-auer · 2025-05-20T09:10:29Z

src/sentry/spans/consumers/process/flusher.py

+                self.process.start()
+                self.process_restarts += 1
+            else:
+                raise RuntimeError("flusher process is dead")


Suggested change

raise RuntimeError("flusher process is dead")

raise RuntimeError("flusher process has crashed")

Ideally we would have the inner error linked here, which is not trivial. We can note in the runbook to check for errors from the flusher process. Can we initialize the SDK with a special tag within the flusher process so this gets easier?

jan-auer · 2025-05-20T09:12:00Z

src/sentry/spans/buffer.py


        return trees

+    def get_stored_segments(self) -> int:


The return value of this is no longer needed and we use this just to record metrics. Let's rename the function accordingly.

jan-auer · 2025-05-20T09:13:52Z

src/sentry/spans/consumers/process/flusher.py

    def poll(self) -> None:
        self.next_step.poll()

    def submit(self, message: Message[FilteredPayload | int]) -> None:


Please add a short docstring that shows the two conditions we return with back pressure. Let's also document in which of the two cases the flusher itself will stop and why.

* master: (58 commits) link: cleanup link (#91687) ref: create project_id index for organizationonboardingtask (#91918) storybook: smaller last edited (#91875) issues: fix chonk stacktrace alignment (#91891) alert: drop custom alert (#91892) insights: fix bar height (#91895) ref(span-buffer): Move max-memory-percentage to right CLI (#91924) ref(js): Factor button functionality (#91763) tests(resolve_groups): Clean up the tests (#91779) ref(span-buffer): Add backpressure (#91707) fix(nextjs-insights): project id is not passed to explore link (#91920) fix(crons): Floor seconds / microsecond on recorded dateClock (#91890) fix(uptime): Fix bug with the uptime_checks dataset in the events endpoint (#91824) ref: add state-only migration to reflect existing indexes in prod (#91901) ref: remove unnecssary metaclass (#91906) fix(stats): use data category title name (#91913) feat(issues): Add success messages to some actions (#91899) test(taskworker): Lower exec time (#91907) chore(aci): manually add spans for delayed workflow processing (#91908) chore(aci): remove uses of WorkflowFireHistory rollout columns (#91904) ...

sentry · 2025-05-20T16:13:16Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ AssertionError: cannot start a process twice sentry.spans.consumers.process.flusher in submit View Issue
‼️ AssertionError: cannot start a process twice sentry.spans.consumers.process.flusher in submit View Issue
‼️ AssertionError sentry.utils.kafka in run_processor_with_signals View Issue
‼️ KafkaException: KafkaError{code=_DESTROY,val=-197,str="Failed to get committed offsets: Local: Broker handle destroyed"} sentry.utils.kafka in run_processor_with_signals View Issue

_{Did you find this useful? React with a 👍 or 👎}

1. If the flusher dies, the consumer will crash too instead of filling up redis further. I believe this is what we recently saw ~13h ago in US: A flusher died because of a kafka blip, but the consumer kept filling up a particular shard. 2. If the flusher is not fast enough and there are too many segments that would be "ready" to flush, we apply backpressure. 3. If redis OOMs, we stop the entire consumer and therefore prevent data from being flushed out prematurely. I'm a bit concerned that point 3 is not acceptable for oncall since it basically stops the consumer until redis is scaled up (unlike other memory-based backpressure where we eventually recover automatically), but from a product POV it's also bad to flush partial segments prematurely. In practice we'll have to overprovision redis' memory, and set early alerts to scale up redis soon enough.

untitaker added 3 commits April 10, 2025 15:39

wip

7fa72a7

wip

557e310

Merge remote-tracking branch 'origin/master' into ref/span-buffer-bac…

4848c20

…kpressure

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 15, 2025

untitaker changed the title ~~ref/span buffer backpressure~~ ref(span-buffer): Add basic backpressure May 15, 2025

vercel bot deployed to Preview May 15, 2025 11:21 View deployment

fix testcase and bug in pipeline usage

8992892

vercel bot deployed to Preview May 15, 2025 12:44 View deployment

untitaker marked this pull request as ready for review May 15, 2025 12:46

untitaker requested a review from a team as a code owner May 15, 2025 12:46

fpacifici reviewed May 16, 2025

View reviewed changes

restart process up to ten times

83c1b0e

vercel bot deployed to Preview May 19, 2025 10:25 View deployment

untitaker added 2 commits May 19, 2025 15:58

wip

8fc6188

rewrite backpressure again

553d939

untitaker requested review from a team as code owners May 19, 2025 14:38

untitaker changed the title ~~ref(span-buffer): Add basic backpressure~~ ref(span-buffer): Add backpressure May 19, 2025

vercel bot deployed to Preview May 19, 2025 14:42 View deployment

untitaker requested a review from jan-auer May 19, 2025 14:44

debug ci

64454fc

vercel bot deployed to Preview May 19, 2025 15:31 View deployment

more debugging

d87e92d

vercel bot deployed to Preview May 19, 2025 16:01 View deployment

Merge branch 'master' into ref/span-buffer-backpressure

617b11f

vercel bot deployed to Preview May 19, 2025 16:20 View deployment

jan-auer approved these changes May 20, 2025

View reviewed changes

untitaker added 2 commits May 20, 2025 11:48

address review feedback

1f53480

reference tag

70f9053

untitaker enabled auto-merge (squash) May 20, 2025 09:50

vercel bot deployed to Preview May 20, 2025 09:56 View deployment

untitaker merged commit b2f34c1 into master May 20, 2025
60 checks passed

untitaker deleted the ref/span-buffer-backpressure branch May 20, 2025 10:08

github-actions bot locked and limited conversation to collaborators Jun 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ref(span-buffer): Add backpressure #91707

ref(span-buffer): Add backpressure #91707

Uh oh!

untitaker commented May 15, 2025 •

edited

Loading

Uh oh!

codecov bot commented May 15, 2025 •

edited

Loading

Uh oh!

fpacifici May 16, 2025

Uh oh!

untitaker May 19, 2025

Uh oh!

jan-auer May 20, 2025

Uh oh!

jan-auer May 20, 2025

Uh oh!

jan-auer May 20, 2025

Uh oh!

jan-auer May 20, 2025

Uh oh!

Uh oh!

sentry bot commented May 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	raise RuntimeError("flusher process is dead")
	raise RuntimeError("flusher process has crashed")

Uh oh!

ref(span-buffer): Add backpressure #91707

ref(span-buffer): Add backpressure #91707

Uh oh!

Conversation

untitaker commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

⚠️ Parser warning

Uh oh!

fpacifici May 16, 2025

Choose a reason for hiding this comment

Uh oh!

untitaker May 19, 2025

Choose a reason for hiding this comment

Uh oh!

jan-auer May 20, 2025

Choose a reason for hiding this comment

Uh oh!

jan-auer May 20, 2025

Choose a reason for hiding this comment

Uh oh!

jan-auer May 20, 2025

Choose a reason for hiding this comment

Uh oh!

jan-auer May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sentry bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Suspect Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

untitaker commented May 15, 2025 •

edited

Loading

codecov bot commented May 15, 2025 •

edited

Loading

sentry bot commented May 20, 2025 •

edited

Loading