Skip to content

Conversation

@thampiotr
Copy link
Contributor

@thampiotr thampiotr commented Nov 27, 2024

PR Description

Fixes a deadlock when using modules and running a lot of components (or being starved of CPU).

The symptom was getting a repeated "failed to submit node for evaluation - Alloy is likely overloaded and cannot keep up with evaluating components - will retry" logs and slowly growing memory because of getting many calls to /metrics that are stuck, until OOMKill in one user's case.

Notes to the Reviewer

  • When EvaluateDependants is called, we submit the dependants of a node that published new exports to a global worker pool for evaluation, so the new export changes can propagate through the graph.
  • The worker pool has a queue that is bounded to limit the memory issues.
  • In extreme cases where there are, for example, 3k components on one Alloy instance, the worker pool's default queue of 1024 items can become full.
  • We don't error when queue is full, we instead retry with backoff with hope that it's temporary overload due to a slow component evaluation / resource starvation and we will manage later.
  • We retry indefinitely.
  • There are other code paths (see below) that interact with the loader on component evaluation and will require the write mutex to complete. Only once these code paths get the write lock on loader mutex, will they complete and create some more space in the worker pool queue.
  • The problem is that these code paths are stuck forever, because we hold on to the loader read mutex while we backoff. We have a deadlock where the backoff is holding the read mutex, unable to add more tasks to the full queue, but the evaluation tasks queue is not draining because it's waiting for the write mutex.

The code path that grabs the loader write mutex:

I think that the default queue size of 1024 is still a good default, more than enough for normal, correct operation. So choosing not to expose it as a configuration knob in order to keep Alloy simpler.

I also think that the backoff behaviour doesn't need to be changed other than aborting it after some time. This way we can recover from other deadlocks while printing a good number of error logs in the process.

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@thampiotr thampiotr force-pushed the thampiotr/deadlock-when-submitting-node branch from 21cde1e to 3765cf3 Compare November 27, 2024 16:17
@thampiotr thampiotr marked this pull request as ready for review November 27, 2024 16:20
@thampiotr thampiotr requested a review from a team as a code owner November 27, 2024 16:20
Copy link
Contributor

@wildum wildum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice catch

@thampiotr thampiotr merged commit a319983 into main Nov 27, 2024
18 checks passed
@thampiotr thampiotr deleted the thampiotr/deadlock-when-submitting-node branch November 27, 2024 17:37
wildum pushed a commit that referenced this pull request Nov 28, 2024
* Fix deadlock due to infinite retry

* changelog
ptodev pushed a commit that referenced this pull request Dec 3, 2024
* Fix deadlock due to infinite retry

* changelog
ptodev added a commit that referenced this pull request Dec 3, 2024
* Fixed an issue in the `otlp.exporter.prometheus` component (#2102)

* Fixed an issue in the `otlp.exporter.prometheus` component

* Fixed an issue in the `otlp.exporter.prometheus` component

* Fix potential deadlock in import statements (#2129)

* Fix potential deadlock in import statements

* change

* typo

* fix: race condition UI remotecfg (#2160)

* Refactor ui remtoecfg components to avoid race condition

* Fix accidental cast to pointer that should have been struct

* Update changelog

* fix: fully prevent panic in remotecfg ui (#2164)

* Fully prevent panic in remotecfg ui

* Address PR feedback

* Fix deadlock due to infinite retry (#2174)

* Fix deadlock due to infinite retry

* changelog

* Update ckit to fix memberlist logging issues (#2186)

* Upgrade ckit and changelog

* go mod tidy

* `loki.source.podlogs`: Fix issue which disables clustering unintentionally. (#2187)

* Fix issue which disables clustering unintentionally.

* prometheus.operator.*: allow setting informer_sync_timeout (#2161)

* prometheus.operator.*: allow setting informer_sync_timeout

* default to 1m

* docs

* fix(pyroscope): allow slashes in tag name (#2172)

* loki.source.podlogs: For clustering only take into account some labels (#2190)

* Only take into account some labels

* Reword docs

* fix: crash when updating import.http config (#2204)

* fix: crash when updating import.http config

* fix key/pattern logic for the attribute processor (#2124)

* fix: Update postgres exporter (#2019)

* Update postgres exporter

* Update changelog

* Use postgres exporter branch that implements exporter package

* Add TODO for future maintainers

* Update VERSION file

* Add missing changelog entry

* Fix pyroscope.write issues with pyroscope.receive_http (#2201)

* Fix pyroscope.write issues with pyroscope.receive_http

The nodejs Pyroscope SDK sends profiles with a `Connection: close` header.
This header was copied to the upstream request, causing connection churn
towards Pyroscope, which can be quite bad on the CPU when using TLS. Do not
copy the `Connection` header from the incoming request to fix this issue.

Additionally, `pyroscope.write` had a single `http.Client` used for
forwarding data from `pyroscope.receive_http`, which may not work if
multiple endpoints are configured with different options. To fix this,
store a `http.Client` for each endpoint.

---------

Co-authored-by: YusifAghalar <[email protected]>
Co-authored-by: Piotr <[email protected]>
Co-authored-by: Sam DeHaan <[email protected]>
Co-authored-by: Craig Peterson <[email protected]>
Co-authored-by: Marc Sanmiquel <[email protected]>
Co-authored-by: Sergei Nikolaev <[email protected]>
Co-authored-by: William Dumont <[email protected]>
Co-authored-by: Sam DeHaan <[email protected]>
Co-authored-by: Gergely Madarász <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants