-
-
Couldn't load subscription status.
- Fork 33.6k
Description
I'm not sure if this is more of a feature request or a bug report.
It's a feature request in the sense that I'm requesting more info in the data returned by node's event loop monitor, but it's a bug report in the sense that the current stats are biased towards small measurements.
In their current state, monitorEventLoopDelay is not very practical to report on sporadic event loop delays do to excessive synchronous work.
This is because it skews the stats towards measurements closer to the resolution sampling rate because no samples are taken while big delays occur, but these missing samples are not taken into account when computing the stats.
- Version: v12.18.1
- Platform: OS X
What steps will reproduce the bug?
Take this example where we do some sync work for 30s, report the event loop delay stats every 60s and use a resolution of 50ms
import { monitorEventLoopDelay } from 'perf_hooks';
const resolution = 50;
const eventLoopDelayReportFreqInMs = 60000;
const doSomeSyncWorkFor30s = () => {
setTimeout(() => {
const s = Date.now();
while (true) {
const arr = new Array(1000);
arr.fill('hello');
if (Date.now() - s > 30000) {
break;
}
}
}, 150);
};
export const startEventLoopDelayMonitor = () => {
const monitorHistogram = monitorEventLoopDelay({ resolution });
monitorHistogram.enable();
setInterval(() => {
// Values are returned in nanoseconds. We transform them to milliseconds
// We subtract the resolution from the values to get a more meaningful metric
// However, sometimes the internal timer gets trigger in less time than the
// "resolution" value. This is way during idle times we may get negative
// "min" values
const percentile25 = monitorHistogram.percentile(25) / 1000000 - resolution;
const percentile50 = monitorHistogram.percentile(50) / 1000000 - resolution;
const percentile75 = monitorHistogram.percentile(75) / 1000000 - resolution;
const percentile95 = monitorHistogram.percentile(95) / 1000000 - resolution;
const percentile99 = monitorHistogram.percentile(99) / 1000000 - resolution;
const min = monitorHistogram.min / 1000000 - resolution;
const max = monitorHistogram.max / 1000000 - resolution;
const mean = monitorHistogram.mean / 1000000 - resolution;
const stddev = monitorHistogram.stddev / 1000000;
console.log({
percentile25,
percentile50,
percentile75,
percentile95,
percentile99,
min,
max,
mean,
stddev
});
}, eventLoopDelayReportFreqInMs);
};
startEventLoopDelayMonitor();
doSomeSyncWorkFor30s();If you run this, the first log stats would be something like:
{
percentile25: 0.3971829999999983,
percentile50: 1.8717430000000022,
percentile75: 3.477375000000002,
percentile95: 5.050238999999998,
percentile99: 5.115774999999999,
min: 0.003968000000000416,
max: 30014.771071,
mean: 54.257005300869565,
stddev: 1250.17883463233
}Besides the max value, the fail to tell the story about how the event loop was blocked for 50% of the time between the start of the script and the first log 60s later.
What is the expected behavior?
The main point of monitoring nodejs's event loop delay is to see if async work is being delayed because of excessive synchronous work.
If this was running on a server, I would have like these stats to give me some information on what percentage of requests would have been delayed because of heavy sync work over one minute.
In this scenario, if I had a server that handled 1 request/second, 50% of requests over 1 minute would have been delayed because of sync work. However, the event loop monitor stats give a picture that everything is "mostly fine" except 1 sample.
What do you see instead?
It seems to me that the percentiles are based on the number of samples but fail to take into account the varying amount of samples over fixed periods of time.
IIUC, there is a timer running at the libuv level which is set to trigger every resolutionms. However, if this timer gets delayed by resolution * Nms, then we will have around N less samples over the same period of time if there wasn't any sync work. This means that the bigger the delay, the less samples we take of big delays. This biases the stats towards small measurements.
In numbers:
30s is 50% of 60s, so I would have expected the percentiles to show something that could have been interpreted as "the event loop was blocked around 50% of the time in the last sampled period".
However:
During 30s we only took 1 sample
During the other 30s, without any delay and with a resolution of 50ms, we took ~600 samples (which are usually around 0 - 5 ms)
1 / 601 = 0.0017
So from this point of view, yeah, the 1 sample is less that 1% of all samples, but that doesn't seem really helpful.
I understand that from a technical standpoint, if the event loop is blocked, you can't take any more samples. However, as a thought experiment imagine that even if it is blocked you could still start a timer every resolutionms. Their reported delay values would be something like
timer 1: 30000ms
timer 2: 29950ms
timer 3: 29900ms
.
.
.
timer29: 50ms
If it had these samples + 600 samples of 1.8ms delays, the stats would look much different:
{
percentile25: 1.8,
percentile50: 25.9,
percentile75: 15012.5,
percentile95: 27002.5,
percentile99: 29400.5,
min: 1.8,
max: 30000,
mean: 7513.4,
stddev: 9691.441089607537
}
I came up with these number by doing this in Python:
import numpy as np
smol = [1.8]*600
big = range(50, 30001, 50)
samples = np.array(smol + big)
np.percentile(samples, 25)
np.percentile(samples, 50)
# ... etcAdditional information
Given that it's not actually feasible to add extra samples during periods where the event loop is blocked, I think that from the user perspective, we're missing the total number of samples taken. I think it would be able to approximatively "un-skew" these values based on the number of samples, the resolution value and the total time during which the event loop monitor has been running. Although, that would only give very rough approximations 🤔