-
Notifications
You must be signed in to change notification settings - Fork 26
Adding retry process for dead letter queue #924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Leftover copypasta from sentry tests
- warn when we have a bug in the queue with no items - various debug messages
Co-authored-by: Mathieu Leplatre <[email protected]>
- turn back into a method - allow callers to pass a bug_id to filter size by bug
Return iterator of items from get() Return dict of bug id, items from get_all() Return backend.get_all() from retrieve (instead of flat list)
grahamalama
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a few comments as a first pass, I'll think more about how to best manage the queue in another review.
jbi/retry.py
Outdated
| CONSTANT_RETRY = getenv("CONSTANT_RETRY", "false") == "true" | ||
| RETRY_TIMEOUT_DAYS = getenv("RETRY_TIMEOUT_DAYS", 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're specifying these values in jbi/environment.py, we should probably access them with
settings = get_settings()
...
settings.constant_retry
...
settings.retry_timeout_days
However, if these settings only pertain to this script (and won't matter to the FastAPI app), maybe we can move them out of environment.py and access them like you have here. That, or build a new BaseSettings object so we get validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran with moving them out of environment.py since they're only used in one place. This feels a little weird because we have 2 entry points into the same application. But we're sharing so much code I think it makes sense. Let me know if you think going another direction is better.
jbi/retry.py
Outdated
| try: | ||
| runner.execute_action(item.payload, ACTIONS) | ||
| await queue.done(item) | ||
| except Exception as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we split runner.execute_action(item.payload, ACTIONS) and await queue.done(item) into separate try / except blocks? There's a small but non-0 chance that we successfully execute an action, but then when we try to mark the item as done, something goes wrong.
We could even retry marking the item as done if something goes wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping this out (this comment might get messy).
Current code workflow looks like this:
- Process event successfully via
execute_action - Fail to mark as done (can't delete file, likely IO error or maybe a k8s config issue)
- Skip processing future events for this bug
- On next retry, we reprocess the same event even though we processed it successfully the first time
- Mark it as done successfully this time
- Process other events in queue for this bug as normal
On the surface, that doesn't seem like a problem. We might process the same event twice (several hours apart), but ultimately we're updating the same data to the same values. What problems may occur here that I'm not thinking of?
If we alter the code to note that we failed marking the item as done it would look like:
- Process event successfully via
execute_action - Fail to mark as done
- Update the existing file (rename or rewrite) to note that we failed to delete it. Or maybe write an additional file?
- Continue processing other events for this bug since we didn't actually fail to update
- On next retry, check for this note so we know to skip
execute_actionand just try to mark it as done - Continue as normal
I think this adds more complexity (more branching, more test cases) and is likely to run into the same IO issue that prevented us from deleting the file. I don't think we gain anything, unless we might stop ourselves from overwriting something in Jira that a user would have changed manually.
Does that stream of thoughts seem right? Am I missing anything?
Edited to add: In most event based architecture you're guaranteed to get the event at least once. Duplicated events will happen sometimes due to networking hiccups, code issues, etc. But they probably don't have a multi-hour retry time like we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What problems may occur here that I'm not thinking of?
Some operations are creation, like comments for example.
We may end up with a comment being posted twice. I don't think it's critical and I value simple code over this (especially because once this is setup, failing to write on a disk should not happen easily)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good thing to note. If that becomes an issue, I think the easiest solution is to do upserts for all create/update commands. That might be a little cumbersome through the Jira API but hopefully we can create a simple wrapper for ourselves if it is.
This also fixes the issue where something was created, but didn't make it into Jira (for whatever reason). And then we receive an update for it later.
This also means we're marking a webhook event's time property as not optional
This allows us to fetch the item identifiers in the queue without loading the items into memory
Also, document QueueItemRetrievalError
…ion into dlq-retry-class
jbi/retry.py
Outdated
| try: | ||
| runner.execute_action(item.payload, ACTIONS) | ||
| await queue.done(item) | ||
| except Exception as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What problems may occur here that I'm not thinking of?
Some operations are creation, like comments for example.
We may end up with a comment being posted twice. I don't think it's critical and I value simple code over this (especially because once this is setup, failing to write on a disk should not happen easily)
| # skip and delete item if we have exceeded max_timeout | ||
| if item.timestamp < min_event_timestamp: | ||
| logger.warning("removing expired event %s", item.identifier) | ||
| await queue.done(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need another method for this? discard() or something with better semantics than done()? Especially if in the queue we log stuff like item X is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queue should be just an IO operator and not have any real logic in it. It feels a little weird to have two different functions that do the same functional operation but with potentially different debug logs. Am I thinking about it wrong?
Looking at the logs the queue emits, we don't currently have a conflict problem. The debug level logs say things like "Removed {event} from queue for bug {bug_id}." and "Removed directory for bug {bug_id}". And our log levels will be above debug in prod and we won't alert on those.
|
Most of my comments are nitpicks 😊 I think there's a way to do something slightly more elegant when it comes to mocking responses of the queue. But I don't put my veto if we want to ship this. We can always refactor in follow-ups if we want to hit our milestone this week 😉 |
jbi/retry.py
Outdated
| skipped_events = await queue.list(bug_id) | ||
| if ( | ||
| len(skipped_events) > 1 | ||
| ): # if this isn't the only event for the bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean? What are we checking for in this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current event we're processing will exist in the queue, so len(queue.list(bug_id)) will always be at least 1. We are only skipping other events if there are more than 1 event pending for the current bug. Does that makes sense and can I make this comment clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched this to just pull size(). But same deal, it will count the current event so we need to subtract 1.
Co-authored-by: Mathieu Leplatre <[email protected]>
…ion into dlq-retry-class
Built on top of #918 .
This adds the retry process and associated unit tests to reprocess failed events from the Dead Letter Queue.
Added unit tests for scenarios outlined in ADR.