(perf_hooks) monitorEventLoopDelay stats are skewed towards small measurements over time

I'm not sure if this is more of a feature request or a bug report. 

It's a feature request in the sense that I'm requesting more info in the data returned by node's event loop monitor, but it's a bug report in the sense that the current stats are biased towards small measurements.

In their current state, `monitorEventLoopDelay` is not very practical to report on sporadic event loop delays do to excessive synchronous work. 

This is because it skews the stats towards measurements closer to the `resolution` sampling rate because no samples are taken while big delays occur, but these missing samples are not taken into account when computing the stats.

* **Version**: v12.18.1
* **Platform**: OS X

### What steps will reproduce the bug?



Take this example where we do some sync work for 30s, report the event loop delay stats every 60s and use a resolution of 50ms
```typescript
import { monitorEventLoopDelay } from 'perf_hooks';

const resolution = 50;
const eventLoopDelayReportFreqInMs = 60000;

const doSomeSyncWorkFor30s = () => {
    setTimeout(() => {
        const s = Date.now();
        while (true) {
            const arr = new Array(1000);
            arr.fill('hello');
            if (Date.now() - s > 30000) {
                break;
            }
        }
    }, 150);
};

export const startEventLoopDelayMonitor = () => {
    const monitorHistogram = monitorEventLoopDelay({ resolution });
    monitorHistogram.enable();

    setInterval(() => {
        // Values are returned in nanoseconds. We transform them to milliseconds
        // We subtract the resolution from the values to get a more meaningful metric
        // However, sometimes the internal timer gets trigger in less time than the
        // "resolution" value. This is way during idle times we may get negative
        // "min" values
        const percentile25 = monitorHistogram.percentile(25) / 1000000 - resolution;
        const percentile50 = monitorHistogram.percentile(50) / 1000000 - resolution;
        const percentile75 = monitorHistogram.percentile(75) / 1000000 - resolution;
        const percentile95 = monitorHistogram.percentile(95) / 1000000 - resolution;
        const percentile99 = monitorHistogram.percentile(99) / 1000000 - resolution;
        const min = monitorHistogram.min / 1000000 - resolution;
        const max = monitorHistogram.max / 1000000 - resolution;
        const mean = monitorHistogram.mean / 1000000 - resolution;
        const stddev = monitorHistogram.stddev / 1000000;

        console.log({
            percentile25,
            percentile50,
            percentile75,
            percentile95,
            percentile99,
            min,
            max,
            mean,
            stddev
        });
    }, eventLoopDelayReportFreqInMs);
};

startEventLoopDelayMonitor();
doSomeSyncWorkFor30s();
```

If you run this, the first log stats would be something like:
```javascript
{
  percentile25: 0.3971829999999983,
  percentile50: 1.8717430000000022,
  percentile75: 3.477375000000002,
  percentile95: 5.050238999999998,
  percentile99: 5.115774999999999,
  min: 0.003968000000000416,
  max: 30014.771071,
  mean: 54.257005300869565,
  stddev: 1250.17883463233
}
```

Besides the `max` value, the fail to tell the story about how the event loop was *blocked* for 50% of the time between the start of the script and the first log 60s later. 

### What is the expected behavior?

The main point of monitoring nodejs's event loop delay is to see if async work is being delayed because of excessive synchronous work. 

If this was running on a server, I would have like these stats to give me some information on what percentage of requests would have been delayed because of heavy sync work over one minute.

In this scenario, if I had a server that handled 1 request/second, 50% of requests over 1 minute would have been delayed because of sync work. However, the event loop monitor stats give a picture that everything is "mostly fine" except 1 sample.



### What do you see instead?

It seems to me that the percentiles are based on the number of samples but fail to take into account the varying amount of samples over fixed periods of time.

IIUC, there is a timer running at the libuv level which is set to trigger every `resolution`ms. However, if this timer gets delayed by `resolution * N`ms, then we will have around `N` less samples over the same period of time if there wasn't any sync work. This means that the bigger the delay, the less samples we take of big delays. This biases the stats towards small measurements.

In numbers:
30s is 50% of 60s, so I would have expected the percentiles to show something that could have been interpreted as "the event loop was blocked around 50% of the time in the last sampled period".

However:
During 30s we only took `1 sample`
During the other 30s, without any delay and with a resolution of 50ms, we took `~600 samples` (which are usually around `0 - 5 ms`)

1 / 601 = 0.0017

So from this point of view, yeah, the 1 sample is less that 1% of all samples, but that doesn't seem really helpful.

I understand that from a technical standpoint, if the event loop is blocked, you can't take any more samples. However, as a thought experiment imagine that even if it is blocked you could still start a timer every `resolution`ms. Their reported delay values would be something like
```
timer 1: 30000ms
timer 2: 29950ms
timer 3: 29900ms
.
.
.
timer29: 50ms
```

If it had these samples + 600 samples of 1.8ms delays, the stats would look much different:
```
{
  percentile25: 1.8,
  percentile50: 25.9,
  percentile75: 15012.5,
  percentile95: 27002.5,
  percentile99: 29400.5,
  min: 1.8,
  max: 30000,
  mean: 7513.4,
  stddev: 9691.441089607537
}
```

I came up with these number by doing this in Python:
```python
import numpy as np
smol = [1.8]*600
big = range(50, 30001, 50)
samples  = np.array(smol + big)
np.percentile(samples, 25)
np.percentile(samples, 50)
# ... etc

```

### Additional information

Given that it's not actually feasible to add extra samples during periods where the event loop is blocked, I think that from the user perspective, we're missing the total number of samples taken.  I think it would be able to approximatively "un-skew" these values based on the number of samples, the resolution value and the total time during which the event loop monitor has been running. Although, that would only give very rough approximations 🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

(perf_hooks) monitorEventLoopDelay stats are skewed towards small measurements over time #34661

What steps will reproduce the bug?

What is the expected behavior?

What do you see instead?

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

(perf_hooks) monitorEventLoopDelay stats are skewed towards small measurements over time #34661

Description

What steps will reproduce the bug?

What is the expected behavior?

What do you see instead?

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions