- 
                Notifications
    
You must be signed in to change notification settings  - Fork 810
 
Description
Hi there! I'm running datadog-agent version 5.9.1-1, and running into an odd issue. I don't think it's a strictly a datadog bug, but I think there are some things the collector should do to mitigate it.
Basically, what we see happening is a build up of gohai processes, each chewing up a cpu core, with kernel errors saying rcu_sched self-detected stall on CPU. We spent a little debugging this, and we don't see anything really wrong with gohai.
But we did notice that it was running at higher priority. This is because, in our environment, sshd runs at high priority, so daemons started from within an ssh session (and apt calling /etc/init.d/datadog-agent is one) inherit that priority. This higher priority causes some contention and eventually stalls.
One change I think datadog should make, is to only spawn one gohai process at a time. If it doesn't come back within the polling period, it seems counter productive to spawn another from the collector. This seems like a pretty simple patch, though I'm not sure how y'all would like it implemented.
I don't know that there's a reasonable way for the agent to run gohai at lower priority. And some sites might want it running at a higher priority, so I think that's something for me to shake out in our infrastructure.