Collector should only run one gohai at a time

Hi there! I'm running datadog-agent version 5.9.1-1, and running into an odd issue. I don't think it's a strictly a datadog bug, but I think there are some things the collector should do to mitigate it.

Basically, what we see happening is a build up of gohai processes, each chewing up a cpu core, with kernel errors saying `rcu_sched self-detected stall on CPU`. We spent a little debugging this, and we don't see anything really wrong with gohai.

But we did notice that it was running at higher priority. This is because, in our environment, sshd runs at high priority, so daemons started from within an ssh session (and apt calling /etc/init.d/datadog-agent is one) inherit that priority. This higher priority causes some contention and eventually stalls.

One change I think datadog should make, is to only spawn one gohai process at a time. If it doesn't come back within the polling period, it seems counter productive to spawn another from the collector. This seems like a pretty simple patch, though I'm not sure how y'all would like it implemented. 

I don't know that there's a reasonable way for the agent to run gohai at lower priority. And some sites might want it running at a higher priority, so I think that's something for me to shake out in our infrastructure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collector should only run one gohai at a time #3085

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Collector should only run one gohai at a time #3085

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions