Fix deadlock due to infinite retry #2174

thampiotr · 2024-11-27T15:03:40Z

PR Description

Fixes a deadlock when using modules and running a lot of components (or being starved of CPU).

The symptom was getting a repeated "failed to submit node for evaluation - Alloy is likely overloaded and cannot keep up with evaluating components - will retry" logs and slowly growing memory because of getting many calls to /metrics that are stuck, until OOMKill in one user's case.

Notes to the Reviewer

When EvaluateDependants is called, we submit the dependants of a node that published new exports to a global worker pool for evaluation, so the new export changes can propagate through the graph.
The worker pool has a queue that is bounded to limit the memory issues.
In extreme cases where there are, for example, 3k components on one Alloy instance, the worker pool's default queue of 1024 items can become full.
We don't error when queue is full, we instead retry with backoff with hope that it's temporary overload due to a slow component evaluation / resource starvation and we will manage later.
We retry indefinitely.
There are other code paths (see below) that interact with the loader on component evaluation and will require the write mutex to complete. Only once these code paths get the write lock on loader mutex, will they complete and create some more space in the worker pool queue.
The problem is that these code paths are stuck forever, because we hold on to the loader read mutex while we backoff. We have a deadlock where the backoff is holding the read mutex, unable to add more tasks to the full queue, but the evaluation tasks queue is not draining because it's waiting for the write mutex.

The code path that grabs the loader write mutex:

Every custom component has its own controller and loader.
When these custom components are evaluated (e.g. when the component definition changes because of import.http found a new definition), they will call LoadBody which will call loadSource which will call applyLoaderConfig which will call loader.Apply which will grab the write lock of the loader mut.

I think that the default queue size of 1024 is still a good default, more than enough for normal, correct operation. So choosing not to expose it as a configuration knob in order to keep Alloy simpler.

I also think that the backoff behaviour doesn't need to be changed other than aborting it after some time. This way we can recover from other deadlocks while printing a good number of error logs in the process.

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

wildum

LGTM, nice catch

* Fix deadlock due to infinite retry * changelog

* Fixed an issue in the `otlp.exporter.prometheus` component (#2102) * Fixed an issue in the `otlp.exporter.prometheus` component * Fixed an issue in the `otlp.exporter.prometheus` component * Fix potential deadlock in import statements (#2129) * Fix potential deadlock in import statements * change * typo * fix: race condition UI remotecfg (#2160) * Refactor ui remtoecfg components to avoid race condition * Fix accidental cast to pointer that should have been struct * Update changelog * fix: fully prevent panic in remotecfg ui (#2164) * Fully prevent panic in remotecfg ui * Address PR feedback * Fix deadlock due to infinite retry (#2174) * Fix deadlock due to infinite retry * changelog * Update ckit to fix memberlist logging issues (#2186) * Upgrade ckit and changelog * go mod tidy * `loki.source.podlogs`: Fix issue which disables clustering unintentionally. (#2187) * Fix issue which disables clustering unintentionally. * prometheus.operator.*: allow setting informer_sync_timeout (#2161) * prometheus.operator.*: allow setting informer_sync_timeout * default to 1m * docs * fix(pyroscope): allow slashes in tag name (#2172) * loki.source.podlogs: For clustering only take into account some labels (#2190) * Only take into account some labels * Reword docs * fix: crash when updating import.http config (#2204) * fix: crash when updating import.http config * fix key/pattern logic for the attribute processor (#2124) * fix: Update postgres exporter (#2019) * Update postgres exporter * Update changelog * Use postgres exporter branch that implements exporter package * Add TODO for future maintainers * Update VERSION file * Add missing changelog entry * Fix pyroscope.write issues with pyroscope.receive_http (#2201) * Fix pyroscope.write issues with pyroscope.receive_http The nodejs Pyroscope SDK sends profiles with a `Connection: close` header. This header was copied to the upstream request, causing connection churn towards Pyroscope, which can be quite bad on the CPU when using TLS. Do not copy the `Connection` header from the incoming request to fix this issue. Additionally, `pyroscope.write` had a single `http.Client` used for forwarding data from `pyroscope.receive_http`, which may not work if multiple endpoints are configured with different options. To fix this, store a `http.Client` for each endpoint. --------- Co-authored-by: YusifAghalar <[email protected]> Co-authored-by: Piotr <[email protected]> Co-authored-by: Sam DeHaan <[email protected]> Co-authored-by: Craig Peterson <[email protected]> Co-authored-by: Marc Sanmiquel <[email protected]> Co-authored-by: Sergei Nikolaev <[email protected]> Co-authored-by: William Dumont <[email protected]> Co-authored-by: Sam DeHaan <[email protected]> Co-authored-by: Gergely Madarász <[email protected]>

thampiotr added 3 commits November 27, 2024 14:58

Fix deadlock due to infinite retry

2422996

changelog

a5ccae3

Merge branch 'main' into thampiotr/deadlock-when-submitting-node

3765cf3

thampiotr force-pushed the thampiotr/deadlock-when-submitting-node branch from 21cde1e to 3765cf3 Compare November 27, 2024 16:17

thampiotr marked this pull request as ready for review November 27, 2024 16:20

thampiotr requested a review from a team as a code owner November 27, 2024 16:20

thampiotr requested review from mattdurham, ptodev and wildum November 27, 2024 16:21

wildum approved these changes Nov 27, 2024

View reviewed changes

thampiotr merged commit a319983 into main Nov 27, 2024
18 checks passed

thampiotr deleted the thampiotr/deadlock-when-submitting-node branch November 27, 2024 17:37

wildum pushed a commit that referenced this pull request Nov 28, 2024

Fix deadlock due to infinite retry (#2174)

4acf0b2

* Fix deadlock due to infinite retry * changelog

ptodev pushed a commit that referenced this pull request Dec 3, 2024

Fix deadlock due to infinite retry (#2174)

2002c8b

* Fix deadlock due to infinite retry * changelog

github-actions bot added the frozen-due-to-age label Dec 28, 2024

github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix deadlock due to infinite retry #2174

Fix deadlock due to infinite retry #2174

Uh oh!

thampiotr commented Nov 27, 2024 •

edited

Loading

Uh oh!

wildum left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix deadlock due to infinite retry #2174

Fix deadlock due to infinite retry #2174

Uh oh!

Conversation

thampiotr commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Notes to the Reviewer

PR Checklist

Uh oh!

wildum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thampiotr commented Nov 27, 2024 •

edited

Loading