Adding retry process for dead letter queue #924

alexcottner · 2024-03-26T19:25:40Z

Built on top of #918 .

This adds the retry process and associated unit tests to reprocess failed events from the Dead Letter Queue.
Added unit tests for scenarios outlined in ADR.

Leftover copypasta from sentry tests

- warn when we have a bug in the queue with no items - various debug messages

Co-authored-by: Mathieu Leplatre <[email protected]>

- turn back into a method - allow callers to pass a bug_id to filter size by bug

Return iterator of items from get() Return dict of bug id, items from get_all() Return backend.get_all() from retrieve (instead of flat list)

tests/unit/test_retry.py

grahamalama

Made a few comments as a first pass, I'll think more about how to best manage the queue in another review.

.env.example

jbi/environment.py

grahamalama · 2024-03-26T20:05:28Z

jbi/retry.py

+CONSTANT_RETRY = getenv("CONSTANT_RETRY", "false") == "true"
+RETRY_TIMEOUT_DAYS = getenv("RETRY_TIMEOUT_DAYS", 7)


If we're specifying these values in jbi/environment.py, we should probably access them with

settings = get_settings() ... settings.constant_retry ... settings.retry_timeout_days

However, if these settings only pertain to this script (and won't matter to the FastAPI app), maybe we can move them out of environment.py and access them like you have here. That, or build a new BaseSettings object so we get validation.

I ran with moving them out of environment.py since they're only used in one place. This feels a little weird because we have 2 entry points into the same application. But we're sharing so much code I think it makes sense. Let me know if you think going another direction is better.

jbi/retry.py

grahamalama · 2024-03-26T20:17:03Z

jbi/retry.py

+            try:
+                runner.execute_action(item.payload, ACTIONS)
+                await queue.done(item)
+            except Exception as ex:


Should we split runner.execute_action(item.payload, ACTIONS) and await queue.done(item) into separate try / except blocks? There's a small but non-0 chance that we successfully execute an action, but then when we try to mark the item as done, something goes wrong.

We could even retry marking the item as done if something goes wrong.

Mapping this out (this comment might get messy).

Current code workflow looks like this:

Process event successfully via execute_action

Fail to mark as done (can't delete file, likely IO error or maybe a k8s config issue)

Skip processing future events for this bug

On next retry, we reprocess the same event even though we processed it successfully the first time

Mark it as done successfully this time

Process other events in queue for this bug as normal

On the surface, that doesn't seem like a problem. We might process the same event twice (several hours apart), but ultimately we're updating the same data to the same values. What problems may occur here that I'm not thinking of?

If we alter the code to note that we failed marking the item as done it would look like:

Process event successfully via execute_action

Fail to mark as done

Update the existing file (rename or rewrite) to note that we failed to delete it. Or maybe write an additional file?

Continue processing other events for this bug since we didn't actually fail to update

On next retry, check for this note so we know to skip execute_action and just try to mark it as done

Continue as normal

I think this adds more complexity (more branching, more test cases) and is likely to run into the same IO issue that prevented us from deleting the file. I don't think we gain anything, unless we might stop ourselves from overwriting something in Jira that a user would have changed manually.

Does that stream of thoughts seem right? Am I missing anything?

Edited to add: In most event based architecture you're guaranteed to get the event at least once. Duplicated events will happen sometimes due to networking hiccups, code issues, etc. But they probably don't have a multi-hour retry time like we do.

What problems may occur here that I'm not thinking of?

Some operations are creation, like comments for example.
We may end up with a comment being posted twice. I don't think it's critical and I value simple code over this (especially because once this is setup, failing to write on a disk should not happen easily)

That's a good thing to note. If that becomes an issue, I think the easiest solution is to do upserts for all create/update commands. That might be a little cumbersome through the Jira API but hopefully we can create a simple wrapper for ourselves if it is.

This also fixes the issue where something was created, but didn't make it into Jira (for whatever reason). And then we receive an update for it later.

tests/unit/test_retry.py

…n't match our schema

This also means we're marking a webhook event's time property as not optional

This allows us to fetch the item identifiers in the queue without loading the items into memory

Also, document QueueItemRetrievalError

…ion into dlq-retry-class

jbi/retry.py

leplatrem · 2024-03-28T11:18:16Z

jbi/retry.py

+            try:
+                runner.execute_action(item.payload, ACTIONS)
+                await queue.done(item)
+            except Exception as ex:


What problems may occur here that I'm not thinking of?

Some operations are creation, like comments for example.
We may end up with a comment being posted twice. I don't think it's critical and I value simple code over this (especially because once this is setup, failing to write on a disk should not happen easily)

tests/unit/test_retry.py

leplatrem · 2024-03-28T11:48:05Z

jbi/retry.py

+                # skip and delete item if we have exceeded max_timeout
+                if item.timestamp < min_event_timestamp:
+                    logger.warning("removing expired event %s", item.identifier)
+                    await queue.done(item)


Do we need another method for this? discard() or something with better semantics than done()? Especially if in the queue we log stuff like item X is done

The queue should be just an IO operator and not have any real logic in it. It feels a little weird to have two different functions that do the same functional operation but with potentially different debug logs. Am I thinking about it wrong?

Looking at the logs the queue emits, we don't currently have a conflict problem. The debug level logs say things like "Removed {event} from queue for bug {bug_id}." and "Removed directory for bug {bug_id}". And our log levels will be above debug in prod and we won't alert on those.

jbi/retry.py

tests/unit/test_retry.py

leplatrem · 2024-03-28T11:57:32Z

Most of my comments are nitpicks 😊

I think there's a way to do something slightly more elegant when it comes to mocking responses of the queue. But I don't put my veto if we want to ship this. We can always refactor in follow-ups if we want to hit our milestone this week 😉

jbi/retry.py

grahamalama · 2024-03-28T00:26:33Z

jbi/retry.py

+                    skipped_events = await queue.list(bug_id)
+                    if (
+                        len(skipped_events) > 1
+                    ):  # if this isn't the only event for the bug


What does this comment mean? What are we checking for in this condition?

The current event we're processing will exist in the queue, so len(queue.list(bug_id)) will always be at least 1. We are only skipping other events if there are more than 1 event pending for the current bug. Does that makes sense and can I make this comment clearer?

Switched this to just pull size(). But same deal, it will count the current event so we need to subtract 1.

jbi/retry.py

tests/unit/test_retry.py

jbi/retry.py

Co-authored-by: Mathieu Leplatre <[email protected]>

…ion into dlq-retry-class

…far.

tests/unit/test_retry.py

grahamalama and others added 23 commits March 20, 2024 10:15

Add pytest-asyncio

91cfb29

Add dead letter queue

d330898

Update secrets.baseline

b0c8d48

Add heartbeat check for queue availability

a7dd23d

Fix invalid_dl_queue_dsn_raises

02f5d03

Leftover copypasta from sentry tests

Ensure module and methods are properly documented

df1bd04

Log size of queue after insertion at debug level

fa05508

Add test for failing file backend ping

4350974

Add some logging to the queue and backends

40c2040

- warn when we have a bug in the queue with no items - various debug messages

Fix logging for bugs with no entries in memory get_all

9bd9c96

Fix typo in get_all docstring

6a01d5a

Co-authored-by: Mathieu Leplatre <[email protected]>

Add debug messages for writing bug to file queue

4b89323

Remove memory backend

f1eb44c

Preserve queue directory in clear()

75048e5

Refactor size

f6b8bd9

- turn back into a method - allow callers to pass a bug_id to filter size by bug

Use size for is_blocked

4467db8

Refactor get(), get_all(), retrieve()

635f42f

Return iterator of items from get() Return dict of bug id, items from get_all() Return backend.get_all() from retrieve (instead of flat list)

Add some missing typing

6d14b9c

Merge remote-tracking branch 'origin/main' into dlq-class

b2bba8e

payload.event.time isn't a callable

d39c7b6

Remote retries property from QueueItemFactory

6236719

Adding retry process

576f9fe

ran lint

7312dde

alexcottner added the enhancement New feature or request label Mar 26, 2024

removed unused var

158cecf

alexcottner commented Mar 26, 2024

View reviewed changes

tests/unit/test_retry.py Outdated Show resolved Hide resolved

grahamalama reviewed Mar 26, 2024

View reviewed changes

alexcottner and others added 3 commits March 26, 2024 15:20

adding some metrics and error handling for async iteration errors

11fc655

a little cleanup

2f814d4

Add tests for errors for invalid json and a webhook payload that does…

d6610f3

…n't match our schema

grahamalama and others added 6 commits March 27, 2024 13:49

Make a queue item timestamp an alias of the event timestamp

c62e86c

This also means we're marking a webhook event's time property as not optional

Add methods for listing items in the queue

63057fc

This allows us to fetch the item identifiers in the queue without loading the items into memory

Catch and reraise custom exception for failing to read item into memory

43a7a15

Colocate custom exceptions

96222d1

Also, document QueueItemRetrievalError

Merge branch 'dlq-class' of github.com:mozilla/jira-bugzilla-integrat…

9bac2d5

…ion into dlq-retry-class

Cleaning up a lot of loose ends. Changes for PR feedback.

33397cd

alexcottner marked this pull request as ready for review March 27, 2024 21:39

alexcottner requested a review from a team as a code owner March 27, 2024 21:39

Fixed env var names

99111e9

leplatrem reviewed Mar 28, 2024

View reviewed changes

grahamalama reviewed Mar 28, 2024

View reviewed changes

jbi/retry.py Outdated Show resolved Hide resolved

grahamalama and others added 4 commits March 28, 2024 10:30

Add methods to access list and list_all from queue class

1ee3ab7

fix comment in jbi/retry

70a4909

Co-authored-by: Mathieu Leplatre <[email protected]>

Merge branch 'dlq-class' of github.com:mozilla/jira-bugzilla-integrat…

8843d34

…ion into dlq-retry-class

Some light refactoring and cleanup, responding to all PR feedback so …

5b76cc8

…far.

alexcottner requested review from grahamalama and leplatrem March 28, 2024 19:45

Base automatically changed from dlq-class to main April 10, 2024 19:01

alexcottner added 2 commits April 10, 2024 14:00

merging main

536f6b6

updating .secrets.baseline again

4a365bc

leplatrem approved these changes Apr 11, 2024

View reviewed changes

tests/unit/test_retry.py Outdated Show resolved Hide resolved

alexcottner added 2 commits April 11, 2024 07:13

fixing one test to use the fixture

23500d1

fixing lint

e5f8e32

alexcottner merged commit 9fa5115 into main Apr 11, 2024

alexcottner deleted the dlq-retry-class branch April 11, 2024 13:35

		CONSTANT_RETRY = getenv("CONSTANT_RETRY", "false") == "true"
		RETRY_TIMEOUT_DAYS = getenv("RETRY_TIMEOUT_DAYS", 7)

Adding retry process for dead letter queue #924

Adding retry process for dead letter queue #924

Uh oh!

Conversation

alexcottner commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

grahamalama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexcottner Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexcottner Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leplatrem commented Mar 28, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexcottner commented Mar 26, 2024 •

edited

Loading

alexcottner Mar 27, 2024 •

edited

Loading

alexcottner Mar 27, 2024 •

edited

Loading