Skip to content

Commit 4ae109a

Browse files
authored
feat(utils): Add stub for rate-limit-based CircuitBreaker class (#74557)
This adds a new `CircuitBreaker` class, which will eventually contain the methods necessary for a rate-limit-based circuit breaker implementation. Our current circuit breaker implementation has four drawbacks that this is aiming to solve: - The standalone `circuit_breaker_activated` function checks how many errors have been tallied, but doesn't have an accompanying way to actually do the tallying, forcing a reimplementation of the error tracking piece in each scenario in which it's used. - The `with_circuit_breaker` wrapper fixes that, but only works if the errors are allowed to bubble from the (potentially fairly low-level) spot where we're actually making the request all the way up to the (potentially much higher-level) spot where we're deciding whether or not to make the request in the first place. Handle the errors gracefully anywhere below the `with_circuit_breaker` call and the breaker has no idea they happened, negating the entire purpose of keeping track. - By catching errors with a simple `try-except`, `with_circuit_breaker` a) counts every kind of exception as an errored request, when that may or may not be accurate, and b) can only track instances of actual `Exception`s, not any other problem (like a 500 response) which may occur. - Both methods reset the breaker from broken to closed the moment a single request goes through. That works well if the issue is a service having a complete outage, but not if the service is instead overloaded such that it's timing out on, say, 95% of requests. In that case, a single request succeeding isn't in fact a good indicator that the service is working and/or has been fixed. To solve these problems, the new implementation: - Will include a way to track errors, so that particular wheel doesn't have to be reinvented over and over (solving problem 1) _and_ so that it can be called selectively and in non-`Exception` spots (solving both halves of problem 3). - Will do the tracking in a separate method from the one checking if requests should be made, so those two functions can happen in different levels of the call stack (solving problem 2). - Will be based on a rate limiter rather than an immediately-resetting tally, so errors don't have to be consecutive to trip the breaker (solving be initial-outage-detection part of problem 4) and so that we don't conclude that an overloaded service has caught back up when it hasn't (solving the service-restoration-detection part of problem 4). This PR adds the class (along with a `CircuitBreakerConfig` class), documentation of how to use it, a constructor which sets up the appropriate rate limits, and tests of said constructor. Future PRs will add other helper methods and add and test the core error-tracking and request-gating methods. The initial use of the circuit breaker will be for Seer-based grouping during ingest. Once we're happy with how it's working, we can go back and refactor the places which use the current circuit breaker to use this new one.
1 parent 7f16895 commit 4ae109a

File tree

2 files changed

+303
-0
lines changed

2 files changed

+303
-0
lines changed
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
"""
2+
NOTE: This is a replacement for the current circuit breaker implementation, which is why it is
3+
`circuit_breaker2`. It's first going to be used for the Seer similarity service, then once we're
4+
confident it works we can replace use of the original for the severity service with use of this one
5+
and get rid of the old one, at which point this can lose the `2`.
6+
"""
7+
8+
import logging
9+
from enum import Enum
10+
from typing import NotRequired, TypedDict
11+
12+
from django.conf import settings
13+
14+
from sentry.ratelimits.sliding_windows import Quota, RedisSlidingWindowRateLimiter
15+
16+
logger = logging.getLogger(__name__)
17+
18+
# XXX: If either of these values changes, the `CircuitBreakerConfig` docstrings
19+
# need to be updated
20+
# How many times stricter to be with the error limit during recovery
21+
DEFAULT_RECOVERY_STRICTNESS = 10
22+
# How many times the length of the error window to make the recovery window
23+
DEFAULT_RECOVERY_WINDOW_MULTIPLIER = 2
24+
25+
26+
class CircuitBreakerState(Enum):
27+
OK = "circuit_okay"
28+
BROKEN = "circuit_broken"
29+
RECOVERY = "recovery"
30+
31+
32+
class CircuitBreakerConfig(TypedDict):
33+
# The number of errors within the given time period necessary to trip the breaker
34+
error_limit: int
35+
# The time period, in seconds, over which we're tracking errors
36+
error_limit_window: int
37+
# How long, in seconds, to stay in the BROKEN state (blocking all requests) before entering the
38+
# RECOVERY phase
39+
broken_state_duration: int
40+
# The number of errors within the given time period necessary to trip the breaker while in
41+
# RECOVERY. Will be set automatically to 10% of `error_limit` if not provided.
42+
recovery_error_limit: NotRequired[int]
43+
# The length, in seconds, of each time bucket ("granule") used by the underlying rate limiter -
44+
# effectively the resolution of the time window. Will be set automatically based on
45+
# `error_limit_window` if not provided.
46+
error_limit_window_granularity: NotRequired[int]
47+
# How long, in seconds, to stay in the RECOVERY state (allowing requests but with a stricter
48+
# error limit) before returning to normal operation. Will be set to twice `error_limit_window`
49+
# if not provided.
50+
recovery_duration: NotRequired[int]
51+
52+
53+
class CircuitBreaker:
54+
"""
55+
A circuit breaker to be used to temporarily block requests to or calls of a service or function
56+
which is throwing too many errors.
57+
58+
The breaker has three states: circuit OK, circuit BROKEN, and circuit in RECOVERY. (These states
59+
respectively correspond to the closed, open, and half-open states of the traditional circuit
60+
breaker model, but are hopefully easier to keep straight than a model where closed is good and
61+
open is bad.)
62+
63+
In a OK state (normal operation), the breaker tracks errors but allows through all requests.
64+
If the frequency of errors passes a given threshold, it moves to BROKEN state.
65+
66+
In a BROKEN state, all requests are blocked. Once a set amount of time has passed, it moves
67+
to RECOVERY state.
68+
69+
RECOVERY state is identical to OK state, except that the threshold for the circuit breaking
70+
(moving back into BROKEN state) is much stricter. Once a set amount of time has passed
71+
without the breaker being tripped, it moves back to OK state.
72+
73+
The overall idea is to stop hitting a service which seems to be failing, but periodically make
74+
short attempts to use it in order to be able to resume requests once it comes back up.
75+
76+
Usage:
77+
78+
# See `CircuitBreakerConfig` class for config options
79+
breaker = CircuitBreaker("squirrel_chasing", config)
80+
81+
def get_top_dogs(payload):
82+
# Check the state of the breaker before calling the service
83+
try:
84+
if breaker.should_allow_request():
85+
response = call_chase_simulation_service("/hall-of-fame", payload)
86+
else:
87+
logger.warning("Request blocked by circuit breaker!")
88+
return None
89+
90+
# Call `record_error` only in `except` blocks whose errors should count towards the quota
91+
except TimeoutError:
92+
breaker.record_error()
93+
return "timeout" # or reraise
94+
except BadInputError:
95+
return "bad input"
96+
except Exception:
97+
breaker.record_error()
98+
return "unknown error"
99+
100+
# Call `record_error` for other problems which should count as errors
101+
if response.status == 500:
102+
breaker.record_error()
103+
return f"got {response.status}"
104+
105+
return format_hof_entries(response)
106+
107+
The `breaker.should_allow_request()` check can alternatively be used outside of `get_top_dogs`,
108+
to prevent calls to it. In that case, the original `breaker` object can be imported alongside
109+
`get_top_dogs` or reinstantiated with the same config - it has no state of its own, instead
110+
relying on redis-backed rate limiters and redis itself to track error count and breaker status.
111+
"""
112+
113+
def __init__(self, key: str, config: CircuitBreakerConfig):
114+
self.key = key
115+
self.broken_state_key = f"{key}.circuit_breaker.broken"
116+
self.recovery_state_key = f"{key}.circuit_breaker.in_recovery"
117+
118+
self.error_limit = config["error_limit"]
119+
default_recovery_error_limit = max(self.error_limit // DEFAULT_RECOVERY_STRICTNESS, 1)
120+
self.recovery_error_limit = config.get("recovery_error_limit", default_recovery_error_limit)
121+
122+
self.window = config["error_limit_window"]
123+
self.window_granularity = config.get(
124+
"error_limit_window_granularity", max(self.window // 20, 5)
125+
)
126+
127+
self.broken_state_duration = config["broken_state_duration"]
128+
self.recovery_duration = config.get(
129+
"recovery_duration", self.window * DEFAULT_RECOVERY_WINDOW_MULTIPLIER
130+
)
131+
132+
self.limiter = RedisSlidingWindowRateLimiter()
133+
self.redis_pipeline = self.limiter.client.pipeline()
134+
135+
self.primary_quota = Quota(
136+
self.window,
137+
self.window_granularity,
138+
self.error_limit,
139+
f"{key}.circuit_breaker.ok",
140+
)
141+
self.recovery_quota = Quota(
142+
self.window,
143+
self.window_granularity,
144+
self.recovery_error_limit,
145+
f"{key}.circuit_breaker.recovery",
146+
)
147+
148+
# In the following sanity checks, if we're in dev, throw an error on bad config so it can be
149+
# fixed permanently. In prod, just warn and fix it ourselves.
150+
log = logger.error if settings.DEBUG else logger.warning
151+
152+
if self.recovery_error_limit >= self.error_limit:
153+
log(
154+
"Circuit breaker '%s' has a recovery error limit (%d) greater than or equal"
155+
+ " to its primary error limit (%d). Using the stricter error-limit-based"
156+
+ " default (%d) instead.",
157+
key,
158+
self.recovery_error_limit,
159+
self.error_limit,
160+
default_recovery_error_limit,
161+
)
162+
self.recovery_error_limit = default_recovery_error_limit
163+
164+
# XXX: If we discover we have a config where we want this combo to work, we can consider
165+
# using the `MockCircuitBreaker._clear_quota` helper, which is currently only used in tests,
166+
# to clear out the main quota when we switch to the BROKEN state. (It will need tests of its
167+
# own if so.)
168+
if self.broken_state_duration + self.recovery_duration < self.window:
169+
default_recovery_duration = self.window - self.broken_state_duration
170+
log(
171+
"Circuit breaker '%s' has BROKEN and RECOVERY state durations (%d and %d sec, respectively)"
172+
+ " which together are less than the main error limit window (%d sec). This can lead to the"
173+
+ " breaker getting tripped unexpectedly, until the original spike in errors clears the"
174+
+ " main time window. Extending RECOVERY period to %d seconds, to give the primary quota time"
175+
+ " to clear.",
176+
key,
177+
self.broken_state_duration,
178+
self.recovery_duration,
179+
self.window,
180+
default_recovery_duration,
181+
)
182+
self.recovery_duration = default_recovery_duration
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
from unittest import TestCase
2+
from unittest.mock import ANY, MagicMock, patch
3+
4+
from django.conf import settings
5+
from redis.client import Pipeline
6+
7+
from sentry.ratelimits.sliding_windows import Quota, RedisSlidingWindowRateLimiter
8+
from sentry.testutils.helpers.datetime import freeze_time
9+
from sentry.utils.circuit_breaker2 import CircuitBreaker, CircuitBreakerConfig
10+
11+
# Note: These need to be relatively big. If the limit is too low, the RECOVERY quota isn't big
12+
# enough to be useful, and if the window is too short, redis (which doesn't seem to listen to the
13+
# @freezetime decorator) will expire the state keys.
14+
DEFAULT_CONFIG: CircuitBreakerConfig = {
15+
"error_limit": 200,
16+
"error_limit_window": 3600, # 1 hr
17+
"broken_state_duration": 120, # 2 min
18+
}
19+
20+
21+
@freeze_time()
22+
class CircuitBreakerTest(TestCase):
23+
def setUp(self) -> None:
24+
self.config = DEFAULT_CONFIG
25+
self.breaker = CircuitBreaker("dogs_are_great", self.config)
26+
27+
# Clear all existing keys from redis
28+
self.breaker.redis_pipeline.flushall()
29+
self.breaker.redis_pipeline.execute()
30+
31+
def test_sets_default_values(self):
32+
breaker = self.breaker
33+
34+
assert breaker.__dict__ == {
35+
"key": "dogs_are_great",
36+
"broken_state_key": "dogs_are_great.circuit_breaker.broken",
37+
"recovery_state_key": "dogs_are_great.circuit_breaker.in_recovery",
38+
"error_limit": 200,
39+
"recovery_error_limit": 20,
40+
"window": 3600,
41+
"window_granularity": 180,
42+
"broken_state_duration": 120,
43+
"recovery_duration": 7200,
44+
# These can't be compared with a simple equality check and therefore are tested
45+
# individually below
46+
"limiter": ANY,
47+
"primary_quota": ANY,
48+
"recovery_quota": ANY,
49+
"redis_pipeline": ANY,
50+
}
51+
assert isinstance(breaker.limiter, RedisSlidingWindowRateLimiter)
52+
assert isinstance(breaker.primary_quota, Quota)
53+
assert isinstance(breaker.recovery_quota, Quota)
54+
assert breaker.primary_quota.__dict__ == {
55+
"window_seconds": 3600,
56+
"granularity_seconds": 180,
57+
"limit": 200,
58+
"prefix_override": "dogs_are_great.circuit_breaker.ok",
59+
}
60+
assert breaker.recovery_quota.__dict__ == {
61+
"window_seconds": 3600,
62+
"granularity_seconds": 180,
63+
"limit": 20,
64+
"prefix_override": "dogs_are_great.circuit_breaker.recovery",
65+
}
66+
assert isinstance(breaker.redis_pipeline, Pipeline)
67+
68+
@patch("sentry.utils.circuit_breaker2.logger")
69+
def test_fixes_too_loose_recovery_limit(self, mock_logger: MagicMock):
70+
config: CircuitBreakerConfig = {
71+
**DEFAULT_CONFIG,
72+
"error_limit": 200,
73+
"recovery_error_limit": 400,
74+
}
75+
76+
for settings_debug_value, expected_log_function in [
77+
(True, mock_logger.error),
78+
(False, mock_logger.warning),
79+
]:
80+
settings.DEBUG = settings_debug_value
81+
breaker = CircuitBreaker("dogs_are_great", config)
82+
83+
expected_log_function.assert_called_with(
84+
"Circuit breaker '%s' has a recovery error limit (%d) greater than or equal"
85+
+ " to its primary error limit (%d). Using the stricter error-limit-based"
86+
+ " default (%d) instead.",
87+
breaker.key,
88+
400,
89+
200,
90+
20,
91+
)
92+
assert breaker.recovery_error_limit == 20
93+
94+
@patch("sentry.utils.circuit_breaker2.logger")
95+
def test_fixes_mismatched_state_durations(self, mock_logger: MagicMock):
96+
config: CircuitBreakerConfig = {
97+
**DEFAULT_CONFIG,
98+
"error_limit_window": 600,
99+
"broken_state_duration": 100,
100+
"recovery_duration": 200,
101+
}
102+
for settings_debug_value, expected_log_function in [
103+
(True, mock_logger.error),
104+
(False, mock_logger.warning),
105+
]:
106+
settings.DEBUG = settings_debug_value
107+
breaker = CircuitBreaker("dogs_are_great", config)
108+
109+
expected_log_function.assert_called_with(
110+
"Circuit breaker '%s' has BROKEN and RECOVERY state durations (%d and %d sec, respectively)"
111+
+ " which together are less than the main error limit window (%d sec). This can lead to the"
112+
+ " breaker getting tripped unexpectedly, until the original spike in errors clears the"
113+
+ " main time window. Extending RECOVERY period to %d seconds, to give the primary quota time"
114+
+ " to clear.",
115+
breaker.key,
116+
100,
117+
200,
118+
600,
119+
500,
120+
)
121+
assert breaker.recovery_duration == 500

0 commit comments

Comments
 (0)