-
Notifications
You must be signed in to change notification settings - Fork 471
Fix deadlock due to infinite retry #2174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
21cde1e to
3765cf3
Compare
wildum
approved these changes
Nov 27, 2024
Contributor
wildum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice catch
wildum
pushed a commit
that referenced
this pull request
Nov 28, 2024
* Fix deadlock due to infinite retry * changelog
ptodev
pushed a commit
that referenced
this pull request
Dec 3, 2024
* Fix deadlock due to infinite retry * changelog
ptodev
added a commit
that referenced
this pull request
Dec 3, 2024
* Fixed an issue in the `otlp.exporter.prometheus` component (#2102) * Fixed an issue in the `otlp.exporter.prometheus` component * Fixed an issue in the `otlp.exporter.prometheus` component * Fix potential deadlock in import statements (#2129) * Fix potential deadlock in import statements * change * typo * fix: race condition UI remotecfg (#2160) * Refactor ui remtoecfg components to avoid race condition * Fix accidental cast to pointer that should have been struct * Update changelog * fix: fully prevent panic in remotecfg ui (#2164) * Fully prevent panic in remotecfg ui * Address PR feedback * Fix deadlock due to infinite retry (#2174) * Fix deadlock due to infinite retry * changelog * Update ckit to fix memberlist logging issues (#2186) * Upgrade ckit and changelog * go mod tidy * `loki.source.podlogs`: Fix issue which disables clustering unintentionally. (#2187) * Fix issue which disables clustering unintentionally. * prometheus.operator.*: allow setting informer_sync_timeout (#2161) * prometheus.operator.*: allow setting informer_sync_timeout * default to 1m * docs * fix(pyroscope): allow slashes in tag name (#2172) * loki.source.podlogs: For clustering only take into account some labels (#2190) * Only take into account some labels * Reword docs * fix: crash when updating import.http config (#2204) * fix: crash when updating import.http config * fix key/pattern logic for the attribute processor (#2124) * fix: Update postgres exporter (#2019) * Update postgres exporter * Update changelog * Use postgres exporter branch that implements exporter package * Add TODO for future maintainers * Update VERSION file * Add missing changelog entry * Fix pyroscope.write issues with pyroscope.receive_http (#2201) * Fix pyroscope.write issues with pyroscope.receive_http The nodejs Pyroscope SDK sends profiles with a `Connection: close` header. This header was copied to the upstream request, causing connection churn towards Pyroscope, which can be quite bad on the CPU when using TLS. Do not copy the `Connection` header from the incoming request to fix this issue. Additionally, `pyroscope.write` had a single `http.Client` used for forwarding data from `pyroscope.receive_http`, which may not work if multiple endpoints are configured with different options. To fix this, store a `http.Client` for each endpoint. --------- Co-authored-by: YusifAghalar <[email protected]> Co-authored-by: Piotr <[email protected]> Co-authored-by: Sam DeHaan <[email protected]> Co-authored-by: Craig Peterson <[email protected]> Co-authored-by: Marc Sanmiquel <[email protected]> Co-authored-by: Sergei Nikolaev <[email protected]> Co-authored-by: William Dumont <[email protected]> Co-authored-by: Sam DeHaan <[email protected]> Co-authored-by: Gergely Madarász <[email protected]>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Description
Fixes a deadlock when using modules and running a lot of components (or being starved of CPU).
The symptom was getting a repeated
"failed to submit node for evaluation - Alloy is likely overloaded and cannot keep up with evaluating components - will retry"logs and slowly growing memory because of getting many calls to/metricsthat are stuck, until OOMKill in one user's case.Notes to the Reviewer
EvaluateDependantsis called, we submit the dependants of a node that published new exports to a global worker pool for evaluation, so the new export changes can propagate through the graph.The code path that grabs the loader write mutex:
loader.Applywhich will grab the write lock of the loadermut.I think that the default queue size of 1024 is still a good default, more than enough for normal, correct operation. So choosing not to expose it as a configuration knob in order to keep Alloy simpler.
I also think that the backoff behaviour doesn't need to be changed other than aborting it after some time. This way we can recover from other deadlocks while printing a good number of error logs in the process.
PR Checklist