Skip to content

Conversation

@untitaker
Copy link
Member

@untitaker untitaker commented May 15, 2025

  1. If the flusher dies, the consumer will crash too instead of filling up redis further. I believe this is what we recently saw ~13h ago in US: A flusher died because of a kafka blip, but the consumer kept filling up a particular shard.
  2. If the flusher is not fast enough and there are too many segments that would be "ready" to flush, we apply backpressure.
  3. If redis OOMs, we stop the entire consumer and therefore prevent data from being flushed out prematurely.

I'm a bit concerned that point 3 is not acceptable for oncall since it basically stops the consumer until redis is scaled up (unlike other memory-based backpressure where we eventually recover automatically), but from a product POV it's also bad to flush partial segments prematurely. In practice we'll have to overprovision redis' memory, and set early alerts to scale up redis soon enough.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 15, 2025
@untitaker untitaker changed the title ref/span buffer backpressure ref(span-buffer): Add basic backpressure May 15, 2025
@codecov
Copy link

codecov bot commented May 15, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 17 lines in your changes missing coverage. Please review.

⚠️ Parser warning

The parser emitted a warning. Please review your JUnit XML file:

Warning while parsing testcase attributes: Limit of string is 1000 chars, for name, we got 2083 at 1:157236 in /home/runner/work/sentry/sentry/.artifacts/pytest.junit.xml
Files with missing lines Patch % Lines
src/sentry/spans/consumers/process/flusher.py 52.94% 16 Missing ⚠️
src/sentry/processing/backpressure/memory.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #91707      +/-   ##
==========================================
+ Coverage   87.62%   87.67%   +0.04%     
==========================================
  Files       10353    10342      -11     
  Lines      586779   586792      +13     
  Branches    22585    22528      -57     
==========================================
+ Hits       514194   514457     +263     
+ Misses      72157    71892     -265     
- Partials      428      443      +15     

@untitaker untitaker marked this pull request as ready for review May 15, 2025 12:46
@untitaker untitaker requested a review from a team as a code owner May 15, 2025 12:46
Comment on lines +313 to +314
key = self._get_queue_key(shard)
p.zcard(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider measuring this in bytes rather than set cardinality as spans can be very different from each other.

A way to do this is to keep a sharded counter of the size of what you write, when you add a batch of spans you increment, when you flush you reduce. Then you can rely on the amount of data in redis rather than the number of spans.

Or, better, measure the amount of free memory in redis like backpressure does. Then you cannot go wrong with the signal to trigger backpressure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. also had a conversation with jan and had to split up into two kinds of backpressure. see updated pr description

@untitaker untitaker requested review from a team as code owners May 19, 2025 14:38
@untitaker untitaker changed the title ref(span-buffer): Add basic backpressure ref(span-buffer): Add backpressure May 19, 2025
@untitaker untitaker requested a review from jan-auer May 19, 2025 14:44
used = sum(x.used for x in memory_infos)
available = sum(x.available for x in memory_infos)
if available > 0 and used / available > self.max_memory_percentage:
logger.fatal("Pausing consumer due to Redis being full")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's debounce this so we don't spam Sentry with too many errors

self.process.start()
self.process_restarts += 1
else:
raise RuntimeError("flusher process is dead")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise RuntimeError("flusher process is dead")
raise RuntimeError("flusher process has crashed")

Ideally we would have the inner error linked here, which is not trivial. We can note in the runbook to check for errors from the flusher process. Can we initialize the SDK with a special tag within the flusher process so this gets easier?


return trees

def get_stored_segments(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of this is no longer needed and we use this just to record metrics. Let's rename the function accordingly.

def poll(self) -> None:
self.next_step.poll()

def submit(self, message: Message[FilteredPayload | int]) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short docstring that shows the two conditions we return with back pressure. Let's also document in which of the two cases the flusher itself will stop and why.

@untitaker untitaker enabled auto-merge (squash) May 20, 2025 09:50
@untitaker untitaker merged commit b2f34c1 into master May 20, 2025
60 checks passed
@untitaker untitaker deleted the ref/span-buffer-backpressure branch May 20, 2025 10:08
jan-auer added a commit that referenced this pull request May 20, 2025
* master: (58 commits)
  link: cleanup link (#91687)
  ref: create project_id index for organizationonboardingtask (#91918)
  storybook: smaller last edited (#91875)
  issues: fix chonk stacktrace alignment (#91891)
  alert: drop custom alert (#91892)
  insights: fix bar height (#91895)
  ref(span-buffer): Move max-memory-percentage to right CLI (#91924)
  ref(js): Factor button functionality (#91763)
  tests(resolve_groups): Clean up the tests (#91779)
  ref(span-buffer): Add backpressure (#91707)
  fix(nextjs-insights): project id is not passed to explore link (#91920)
  fix(crons): Floor seconds / microsecond on recorded dateClock (#91890)
  fix(uptime): Fix bug with the uptime_checks dataset in the events endpoint (#91824)
  ref: add state-only migration to reflect existing indexes in prod (#91901)
  ref: remove unnecssary metaclass (#91906)
  fix(stats): use data category title name (#91913)
  feat(issues): Add success messages to some actions (#91899)
  test(taskworker): Lower exec time (#91907)
  chore(aci): manually add spans for delayed workflow processing (#91908)
  chore(aci): remove uses of WorkflowFireHistory rollout columns (#91904)
  ...
@sentry
Copy link

sentry bot commented May 20, 2025

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ AssertionError: cannot start a process twice sentry.spans.consumers.process.flusher in submit View Issue
  • ‼️ AssertionError: cannot start a process twice sentry.spans.consumers.process.flusher in submit View Issue
  • ‼️ AssertionError sentry.utils.kafka in run_processor_with_signals View Issue
  • ‼️ KafkaException: KafkaError{code=_DESTROY,val=-197,str="Failed to get committed offsets: Local: Broker handle destroyed"} sentry.utils.kafka in run_processor_with_signals View Issue

Did you find this useful? React with a 👍 or 👎

andrewshie-sentry pushed a commit that referenced this pull request May 20, 2025
1. If the flusher dies, the consumer will crash too instead of filling
up redis further. I believe this is what we recently saw ~13h ago in US:
A flusher died because of a kafka blip, but the consumer kept filling up
a particular shard.
2. If the flusher is not fast enough and there are too many segments
that would be "ready" to flush, we apply backpressure.
3. If redis OOMs, we stop the entire consumer and therefore prevent data
from being flushed out prematurely.

I'm a bit concerned that point 3 is not acceptable for oncall since it
basically stops the consumer until redis is scaled up (unlike other
memory-based backpressure where we eventually recover automatically),
but from a product POV it's also bad to flush partial segments
prematurely. In practice we'll have to overprovision redis' memory, and
set early alerts to scale up redis soon enough.
@github-actions github-actions bot locked and limited conversation to collaborators Jun 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants