netstacklat - Handle TCP head-of-line blocking #133

simosund · 2025-09-29T13:48:43Z

Currently netstacklat may report latency for head-of-line (HOL) blocked TCP reads. TCP HOL-blocking can occur when there is packet loss or reordering, when the TCP stack has to delay the delivery of a packet to the socket receive queue until the preceding TCP segments have been received.

As this HOL-blocking is dependent on factors external to the local host (i.e. the packet loss and latency of the entire network path from the sender), this is not the type of latency we typically want to highlight with netstacklat. This PR addresses this issue by not recording the latency for reads that may experience HOL-blocking.

See the commit messages for further details on the how this is accomplished and some limitations.

The 'tcp-socket-read' currently reports the latency for the skb containing the last TCP segment read from the socket. However, this segment might have been head of line (HOL) blocked by a previous segment missing. In this case, netstacklat's reported latency will include HOL blocking periods that is dependent on external factors (such as network packet loss, and network latency impacts retransmission time). As netstacklat is primarily intended to identify issues within the local host (in the network stack or receiving applications), filter out any socket reads were the last read SKB might have experienced HOL-blocking. To exclude HOL-blocked reads, track whenever a TCP segment arrives out-of-order (ooo), i.e. ahead of the expected next sequence. These segments are kept in a separate ooo-queue and later merged back into the socket receive queue once the missing segment gap has been filled. The ooo-segments are therefore the only segments that may experience HOL blocking (as non ooo-segments are immediately added to the socket receive queue). Specifically, track the right edge (the maximum) of the ooo sequence range. If the final byte of any socket read is lower than right edge of the ooo sequence range, filter it out as potentially HOL blocked. If the last read byte is ahead of the ooo-range, the last byte must have been read from a segment that arrived in-order and therefore haven't experienced HOL-blocking, so it's safe to include the latency measurement. Furthermore, also invalidate the ooo-range when the last read byte is ahead to prevent an old ooo-range value sticking around until the sequence number wraps around. There are some scenarios that may cause this ooo-filtering to fail. - If the program is started in the middle of an ongoing flow, there may be ooo segments that arrived before we could record them in the the tracked ooo-range. This may cause the latency for some initial HOL-blocked reads to be reported. - If multiple reads are done to the socket concurrently, we may not correctly track the last read byte. The kernel does not keep a lock on the TCP socket at the time our hooked function tcp_recv_timestamp() runs. If two reads are done in parallel, it's therefore possible that for both reads we will check the last read byte (tcp_sock.copied_seq) after the second read has updated it. We may then incorrectly conclude that the first read was ahead of the ooo-range when it was not, and record its latency when we should have excluded it. In practice I belive this issue should be quite rare, as most applications will probably not attempt to perform multiple concurrent reads to a single connected TCP socket in parallel (as then you cannot know which part of the payload the parallel reads will return). - The tracked ooo-range may concurrently be updated at tcp_data_queue_ofo() and and tcp_recv_timestamp(), leading to inconsitent state. The last of these issue will be addressed in a later commit that adds bpf spin locks. Signed-off-by: Simon Sundberg <[email protected]>

Currently there is no synchronization between the updates and checks of the per-socket ooo-range used to filter out potentially HOL-blocked measurements. While the kernel holds a socket lock when the ooo-range is updated in tcp_data_queue_ofo(), the kernel does not seem to hold a socket lock by the time we check the ooo-range in tcp_recv_timestamp(). To prevent the check of the ooo-range from reading the ooo-range while it's concurrently updated (resulting in a potentially inconsistent state), add a bpf_spin_lock to protect updates and checks of the ooo-range. Signed-off-by: Simon Sundberg <[email protected]>

To make netstacklat only report latency based on the local network and application stack, it by default excludes TCP reads that may be head-of-line (HOL) blocked by a missing segment. However, in some scenarios, it may be desirable to still include these HOL-blocked measurements. For example, in a data center where the full network is under your control, high latency caused by drops in the data center network could still be an issue you wish netstacklat to detect. Therefore, add the -y/--include-tcp-hol-delay option, which disables the mechanism used to filter out potentially HOL-blocked reads. An alternative design to simply disabling the filtering mechanism, would be to separate the potentially HOL-blocked measurements into their own histogram (e.g. "tcp-socket-read-hol-blocked"). However, that would further increase the data cardinality (which may be a concern when using it together ebpf-exporter). Furthermore, by simply disabling the filtering instead we can reduce the overhead a bit and also allow it as an option on systems were the tcp_data_queue_ofo() function cannot be probed. Signed-off-by: Simon Sundberg <[email protected]>

simosund · 2025-09-29T15:43:37Z

In the following test I demonstrate the issue this PR attempts to fix and what the impact of this patch is. Here I've fetching some 100kB files over a link with 34ms latency and loss varying between 0%, 0.1%, 0.5%, 1%, 2%, 5%, 10% and finally back to 0% again.

Interactive versions of the graphs below together with some additional data is available in this Grafana snapshot (the "before" data corresponds with the earlier half in the snapshot, and "after" is the later half).

NOTE: "Before" is actually still from this PR, just with the HOL-blocking filter it introduces disabled, roughly corresponding to passing the new --include-tcp-hol-delay option. It sets a constant in the rodata, so the JIT compiler should dead-code eliminate the whole thing.

Impact on latency distribution

Before

As can be seen, as soon as there's some packet loss we start seeing a small fraction of significantly higher latencies, several of them starting in the 16ms+ bucket (that bucket ends just below the configured 34ms link latency). As the packet loss is increased, we see a roughly proportional growth in the fraction of high latency values caused by HOL blocking. At higher packet loss rates we even see some latencies streching into the 500ms+ range (likely from timeouts or the same segment needing to be retransmitted multiple times)

After

It successfully avoids recording any latency values which suffer from considerable HOL-blocking. We can still see a little bit of a shift in the latency distribution for the non-HOL-blocked reads though.

Number of measurements

NOTE: I've updated this section, as in the initial posting I accidentally used the wrong hook (tcp-socket-enqueued) and then found no difference in number of measurement between before and after (which is to be expected, as this PR has no affect on the tcp-socket-enqueued hook). I've now updated this with correct figures for tcp-socket-read.

Before:

netstacklat_tcpread_count_withoutholfilt

After:

As expected we see a small drop in the number of measurements when we exclude the potentially HOL-blocked reads. As the packet loss increases the relative reduction in samples grows. Overall, the reduction is pretty small though, suggesting that we're not discarding a large amount of valid samples.

eBPF program runtime

Before

After

It does clearly add a bit of overhead for the netstacklat_tcp_recv_timestamp program (which is the one recording the tcp-socket-read latency). When there's no packet loss, the difference appears to be quite small, with both before and after showing around 250ns per invocation. The overhead does however seem to grow as the packet loss increases, reaching around 400ns at 1% packet loss (vs around 280ns before). However, past 1% the overhead drops down to around 330ns (vs 280-320ns before).

Clearly there is some variation in the measured runtime here and the numbers are not super exact. Overall though, I would say that the additional overhead is measurable but most of the time not too problematic. As this shows the average overhead, we get a bit of a tradeoff from the additional work of looking up the ooo-range and comparing it to the last byte read, vs lower overhead in the instances we exclude the measurement and then don't have do the hashmap lookups to update the histogram.

Here we can also see the netstacklat_tcp_queue_ofo() program, which is responsible for tracking the upper ooo sequence range whenever there's an out-of-order packet. In the before case it's basically a NOP (but still measures in at 30ns here). In the after case it still only takes between 70-100ns. As it only runs when there are out-of-order packets, the overall overhead from this is fairly low. In this these runs, it only ran about 2.5% as frequently as netstacklat_tcp_recv_timestamp at 0.1% packet loss, and about 25% as frequently at 10% packet loss.

Total (estimated) eBPF CPU overhead

Before

After

This is the estimated CPU overhead for all eBPF programs from netstacklat (based on the cumulative runtime for each eBPF program invocation as measured by kernel.bpf_stats_enabled). The largest difference is at 0.1% packet loss, where the total CPU utilization increased by around 10% (from ~0.175% to ~0.195%). If we focus on only the affected programs, netstacklat_tcp_recv_timestamp and new netstacklat_tcp_data_queue_ofo, this increases to about 35% (from ~0.033% to 0.045%). For all the other cases the relative increase in overhead is lower, with around 25% at 1% packet loss, 10% at 10% loss, and 5% at 0% loss.

simosund · 2025-09-29T16:29:09Z

When you have time @netoptimizer, could you check if tcp_data_queue_ofo() is traceable on your systems. I.e that you can attach an fentry program to it.

$ sudo bpftrace -lv 'fentry:tcp_data_queue_ofo'
kfunc:vmlinux:tcp_data_queue_ofo
    struct sock * sk
    struct sk_buff * skb

Discovered that on some systems I could not trace tcp_data_queue_ofo(), presumably because it's been inlined. If tcp_data_queue_ofo() is unavailable I'm not really sure how to solve this issue. Could possibly try to kprobe on tcp_data_queue() with the offset to where tcp_data_queue_ofo() is called (the only callsite), but kprobe with offsets seems cumbersome (based on Toke's blog post) and hardly portable.

simosund added 3 commits September 29, 2025 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

netstacklat - Handle TCP head-of-line blocking #133

netstacklat - Handle TCP head-of-line blocking #133

Uh oh!

simosund commented Sep 29, 2025

Uh oh!

simosund commented Sep 29, 2025 •

edited

Loading

Uh oh!

simosund commented Sep 29, 2025

Uh oh!

Uh oh!

netstacklat - Handle TCP head-of-line blocking #133

Are you sure you want to change the base?

netstacklat - Handle TCP head-of-line blocking #133

Uh oh!

Conversation

simosund commented Sep 29, 2025

Uh oh!

simosund commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Impact on latency distribution

Before

After

Number of measurements

Before:

After:

eBPF program runtime

Before

After

Total (estimated) eBPF CPU overhead

Before

After

Uh oh!

simosund commented Sep 29, 2025

Uh oh!

Uh oh!

simosund commented Sep 29, 2025 •

edited

Loading