Skip to content

Conversation

@alexcottner
Copy link
Contributor

@alexcottner alexcottner commented Mar 26, 2024

Built on top of #918 .

This adds the retry process and associated unit tests to reprocess failed events from the Dead Letter Queue.
Added unit tests for scenarios outlined in ADR.

@alexcottner alexcottner added the enhancement New feature or request label Mar 26, 2024
Copy link
Contributor

@grahamalama grahamalama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few comments as a first pass, I'll think more about how to best manage the queue in another review.

jbi/retry.py Outdated
Comment on lines 12 to 13
CONSTANT_RETRY = getenv("CONSTANT_RETRY", "false") == "true"
RETRY_TIMEOUT_DAYS = getenv("RETRY_TIMEOUT_DAYS", 7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're specifying these values in jbi/environment.py, we should probably access them with

settings = get_settings()
...
settings.constant_retry
...
settings.retry_timeout_days

However, if these settings only pertain to this script (and won't matter to the FastAPI app), maybe we can move them out of environment.py and access them like you have here. That, or build a new BaseSettings object so we get validation.

Copy link
Contributor Author

@alexcottner alexcottner Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran with moving them out of environment.py since they're only used in one place. This feels a little weird because we have 2 entry points into the same application. But we're sharing so much code I think it makes sense. Let me know if you think going another direction is better.

jbi/retry.py Outdated
Comment on lines 41 to 44
try:
runner.execute_action(item.payload, ACTIONS)
await queue.done(item)
except Exception as ex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split runner.execute_action(item.payload, ACTIONS) and await queue.done(item) into separate try / except blocks? There's a small but non-0 chance that we successfully execute an action, but then when we try to mark the item as done, something goes wrong.

We could even retry marking the item as done if something goes wrong.

Copy link
Contributor Author

@alexcottner alexcottner Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping this out (this comment might get messy).

Current code workflow looks like this:

  1. Process event successfully via execute_action
  2. Fail to mark as done (can't delete file, likely IO error or maybe a k8s config issue)
  3. Skip processing future events for this bug
  4. On next retry, we reprocess the same event even though we processed it successfully the first time
  5. Mark it as done successfully this time
  6. Process other events in queue for this bug as normal

On the surface, that doesn't seem like a problem. We might process the same event twice (several hours apart), but ultimately we're updating the same data to the same values. What problems may occur here that I'm not thinking of?

If we alter the code to note that we failed marking the item as done it would look like:

  1. Process event successfully via execute_action
  2. Fail to mark as done
  3. Update the existing file (rename or rewrite) to note that we failed to delete it. Or maybe write an additional file?
  4. Continue processing other events for this bug since we didn't actually fail to update
  5. On next retry, check for this note so we know to skip execute_action and just try to mark it as done
  6. Continue as normal

I think this adds more complexity (more branching, more test cases) and is likely to run into the same IO issue that prevented us from deleting the file. I don't think we gain anything, unless we might stop ourselves from overwriting something in Jira that a user would have changed manually.

Does that stream of thoughts seem right? Am I missing anything?

Edited to add: In most event based architecture you're guaranteed to get the event at least once. Duplicated events will happen sometimes due to networking hiccups, code issues, etc. But they probably don't have a multi-hour retry time like we do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What problems may occur here that I'm not thinking of?

Some operations are creation, like comments for example.
We may end up with a comment being posted twice. I don't think it's critical and I value simple code over this (especially because once this is setup, failing to write on a disk should not happen easily)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good thing to note. If that becomes an issue, I think the easiest solution is to do upserts for all create/update commands. That might be a little cumbersome through the Jira API but hopefully we can create a simple wrapper for ourselves if it is.

This also fixes the issue where something was created, but didn't make it into Jira (for whatever reason). And then we receive an update for it later.

grahamalama and others added 6 commits March 27, 2024 13:49
This also means we're marking a webhook event's time property as not optional
This allows us to fetch the item identifiers in the queue without loading the items into memory
Also, document QueueItemRetrievalError
@alexcottner alexcottner marked this pull request as ready for review March 27, 2024 21:39
@alexcottner alexcottner requested a review from a team as a code owner March 27, 2024 21:39
jbi/retry.py Outdated
Comment on lines 41 to 44
try:
runner.execute_action(item.payload, ACTIONS)
await queue.done(item)
except Exception as ex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What problems may occur here that I'm not thinking of?

Some operations are creation, like comments for example.
We may end up with a comment being posted twice. I don't think it's critical and I value simple code over this (especially because once this is setup, failing to write on a disk should not happen easily)

# skip and delete item if we have exceeded max_timeout
if item.timestamp < min_event_timestamp:
logger.warning("removing expired event %s", item.identifier)
await queue.done(item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need another method for this? discard() or something with better semantics than done()? Especially if in the queue we log stuff like item X is done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue should be just an IO operator and not have any real logic in it. It feels a little weird to have two different functions that do the same functional operation but with potentially different debug logs. Am I thinking about it wrong?

Looking at the logs the queue emits, we don't currently have a conflict problem. The debug level logs say things like "Removed {event} from queue for bug {bug_id}." and "Removed directory for bug {bug_id}". And our log levels will be above debug in prod and we won't alert on those.

@leplatrem
Copy link
Contributor

Most of my comments are nitpicks 😊

I think there's a way to do something slightly more elegant when it comes to mocking responses of the queue. But I don't put my veto if we want to ship this. We can always refactor in follow-ups if we want to hit our milestone this week 😉

jbi/retry.py Outdated
skipped_events = await queue.list(bug_id)
if (
len(skipped_events) > 1
): # if this isn't the only event for the bug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean? What are we checking for in this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current event we're processing will exist in the queue, so len(queue.list(bug_id)) will always be at least 1. We are only skipping other events if there are more than 1 event pending for the current bug. Does that makes sense and can I make this comment clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched this to just pull size(). But same deal, it will count the current event so we need to subtract 1.

Base automatically changed from dlq-class to main April 10, 2024 19:01
@alexcottner alexcottner merged commit 9fa5115 into main Apr 11, 2024
@alexcottner alexcottner deleted the dlq-retry-class branch April 11, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants