-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
ref(span-buffer): Add backpressure #91707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
|
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/sentry/spans/consumers/process/flusher.py | 52.94% | 16 Missing |
| src/sentry/processing/backpressure/memory.py | 85.71% | 1 Missing |
Additional details and impacted files
@@ Coverage Diff @@
## master #91707 +/- ##
==========================================
+ Coverage 87.62% 87.67% +0.04%
==========================================
Files 10353 10342 -11
Lines 586779 586792 +13
Branches 22585 22528 -57
==========================================
+ Hits 514194 514457 +263
+ Misses 72157 71892 -265
- Partials 428 443 +15 | key = self._get_queue_key(shard) | ||
| p.zcard(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider measuring this in bytes rather than set cardinality as spans can be very different from each other.
A way to do this is to keep a sharded counter of the size of what you write, when you add a batch of spans you increment, when you flush you reduce. Then you can rely on the amount of data in redis rather than the number of spans.
Or, better, measure the amount of free memory in redis like backpressure does. Then you cannot go wrong with the signal to trigger backpressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. also had a conversation with jan and had to split up into two kinds of backpressure. see updated pr description
| used = sum(x.used for x in memory_infos) | ||
| available = sum(x.available for x in memory_infos) | ||
| if available > 0 and used / available > self.max_memory_percentage: | ||
| logger.fatal("Pausing consumer due to Redis being full") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's debounce this so we don't spam Sentry with too many errors
| self.process.start() | ||
| self.process_restarts += 1 | ||
| else: | ||
| raise RuntimeError("flusher process is dead") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| raise RuntimeError("flusher process is dead") | |
| raise RuntimeError("flusher process has crashed") |
Ideally we would have the inner error linked here, which is not trivial. We can note in the runbook to check for errors from the flusher process. Can we initialize the SDK with a special tag within the flusher process so this gets easier?
src/sentry/spans/buffer.py
Outdated
|
|
||
| return trees | ||
|
|
||
| def get_stored_segments(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value of this is no longer needed and we use this just to record metrics. Let's rename the function accordingly.
| def poll(self) -> None: | ||
| self.next_step.poll() | ||
|
|
||
| def submit(self, message: Message[FilteredPayload | int]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a short docstring that shows the two conditions we return with back pressure. Let's also document in which of the two cases the flusher itself will stop and why.
* master: (58 commits) link: cleanup link (#91687) ref: create project_id index for organizationonboardingtask (#91918) storybook: smaller last edited (#91875) issues: fix chonk stacktrace alignment (#91891) alert: drop custom alert (#91892) insights: fix bar height (#91895) ref(span-buffer): Move max-memory-percentage to right CLI (#91924) ref(js): Factor button functionality (#91763) tests(resolve_groups): Clean up the tests (#91779) ref(span-buffer): Add backpressure (#91707) fix(nextjs-insights): project id is not passed to explore link (#91920) fix(crons): Floor seconds / microsecond on recorded dateClock (#91890) fix(uptime): Fix bug with the uptime_checks dataset in the events endpoint (#91824) ref: add state-only migration to reflect existing indexes in prod (#91901) ref: remove unnecssary metaclass (#91906) fix(stats): use data category title name (#91913) feat(issues): Add success messages to some actions (#91899) test(taskworker): Lower exec time (#91907) chore(aci): manually add spans for delayed workflow processing (#91908) chore(aci): remove uses of WorkflowFireHistory rollout columns (#91904) ...
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
1. If the flusher dies, the consumer will crash too instead of filling up redis further. I believe this is what we recently saw ~13h ago in US: A flusher died because of a kafka blip, but the consumer kept filling up a particular shard. 2. If the flusher is not fast enough and there are too many segments that would be "ready" to flush, we apply backpressure. 3. If redis OOMs, we stop the entire consumer and therefore prevent data from being flushed out prematurely. I'm a bit concerned that point 3 is not acceptable for oncall since it basically stops the consumer until redis is scaled up (unlike other memory-based backpressure where we eventually recover automatically), but from a product POV it's also bad to flush partial segments prematurely. In practice we'll have to overprovision redis' memory, and set early alerts to scale up redis soon enough.
I'm a bit concerned that point 3 is not acceptable for oncall since it basically stops the consumer until redis is scaled up (unlike other memory-based backpressure where we eventually recover automatically), but from a product POV it's also bad to flush partial segments prematurely. In practice we'll have to overprovision redis' memory, and set early alerts to scale up redis soon enough.