Skip to content

Conversation

simosund
Copy link
Contributor

Currently netstacklat may report latency for head-of-line (HOL) blocked TCP reads. TCP HOL-blocking can occur when there is packet loss or reordering, when the TCP stack has to delay the delivery of a packet to the socket receive queue until the preceding TCP segments have been received.

As this HOL-blocking is dependent on factors external to the local host (i.e. the packet loss and latency of the entire network path from the sender), this is not the type of latency we typically want to highlight with netstacklat. This PR addresses this issue by not recording the latency for reads that may experience HOL-blocking.

See the commit messages for further details on the how this is accomplished and some limitations.

The 'tcp-socket-read' currently reports the latency for the skb
containing the last TCP segment read from the socket. However, this
segment might have been head of line (HOL) blocked by a previous
segment missing. In this case, netstacklat's reported latency will
include HOL blocking periods that is dependent on external
factors (such as network packet loss, and network latency impacts
retransmission time). As netstacklat is primarily intended to identify
issues within the local host (in the network stack or receiving
applications), filter out any socket reads were the last read SKB
might have experienced HOL-blocking.

To exclude HOL-blocked reads, track whenever a TCP segment arrives
out-of-order (ooo), i.e. ahead of the expected next sequence. These
segments are kept in a separate ooo-queue and later merged back into
the socket receive queue once the missing segment gap has been
filled. The ooo-segments are therefore the only segments that may
experience HOL blocking (as non ooo-segments are immediately added to
the socket receive queue). Specifically, track the right edge (the
maximum) of the ooo sequence range. If the final byte of any socket
read is lower than right edge of the ooo sequence range, filter it out
as potentially HOL blocked. If the last read byte is ahead of the
ooo-range, the last byte must have been read from a segment that
arrived in-order and therefore haven't experienced HOL-blocking, so
it's safe to include the latency measurement. Furthermore, also
invalidate the ooo-range when the last read byte is ahead to prevent
an old ooo-range value sticking around until the sequence number wraps
around.

There are some scenarios that may cause this ooo-filtering to fail.
- If the program is started in the middle of an ongoing flow, there
  may be ooo segments that arrived before we could record them in the
  the tracked ooo-range. This may cause the latency for some initial
  HOL-blocked reads to be reported.

- If multiple reads are done to the socket concurrently, we may not
  correctly track the last read byte. The kernel does not keep a lock
  on the TCP socket at the time our hooked function
  tcp_recv_timestamp() runs. If two reads are done in parallel, it's
  therefore possible that for both reads we will check the last read
  byte (tcp_sock.copied_seq) after the second read has updated it. We
  may then incorrectly conclude that the first read was ahead of the
  ooo-range when it was not, and record its latency when we should
  have excluded it. In practice I belive this issue should be quite
  rare, as most applications will probably not attempt to perform
  multiple concurrent reads to a single connected TCP socket in
  parallel (as then you cannot know which part of the payload the
  parallel reads will return).

- The tracked ooo-range may concurrently be updated at
  tcp_data_queue_ofo() and and tcp_recv_timestamp(), leading to
  inconsitent state.

The last of these issue will be addressed in a later commit that adds
bpf spin locks.

Signed-off-by: Simon Sundberg <[email protected]>
Currently there is no synchronization between the updates and checks
of the per-socket ooo-range used to filter out potentially HOL-blocked
measurements. While the kernel holds a socket lock when the ooo-range
is updated in tcp_data_queue_ofo(), the kernel does not seem to hold a
socket lock by the time we check the ooo-range in
tcp_recv_timestamp(). To prevent the check of the ooo-range from
reading the ooo-range while it's concurrently updated (resulting in a
potentially inconsistent state), add a bpf_spin_lock to protect
updates and checks of the ooo-range.

Signed-off-by: Simon Sundberg <[email protected]>
To make netstacklat only report latency based on the local network and
application stack, it by default excludes TCP reads that may be
head-of-line (HOL) blocked by a missing segment.

However, in some scenarios, it may be desirable to still include these
HOL-blocked measurements. For example, in a data center where the full
network is under your control, high latency caused by drops in the
data center network could still be an issue you wish netstacklat to
detect. Therefore, add the -y/--include-tcp-hol-delay option, which
disables the mechanism used to filter out potentially HOL-blocked
reads.

An alternative design to simply disabling the filtering mechanism,
would be to separate the potentially HOL-blocked measurements into
their own histogram (e.g. "tcp-socket-read-hol-blocked"). However,
that would further increase the data cardinality (which may be a
concern when using it together ebpf-exporter). Furthermore, by simply
disabling the filtering instead we can reduce the overhead a bit and
also allow it as an option on systems were the tcp_data_queue_ofo()
function cannot be probed.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund
Copy link
Contributor Author

simosund commented Sep 29, 2025

In the following test I demonstrate the issue this PR attempts to fix and what the impact of this patch is. Here I've fetching some 100kB files over a link with 34ms latency and loss varying between 0%, 0.1%, 0.5%, 1%, 2%, 5%, 10% and finally back to 0% again.

Interactive versions of the graphs below together with some additional data is available in this Grafana snapshot (the "before" data corresponds with the earlier half in the snapshot, and "after" is the later half).

NOTE: "Before" is actually still from this PR, just with the HOL-blocking filter it introduces disabled, roughly corresponding to passing the new --include-tcp-hol-delay option. It sets a constant in the rodata, so the JIT compiler should dead-code eliminate the whole thing.

Impact on latency distribution

Before

netstacklat_tcpread_withoutholfilt

As can be seen, as soon as there's some packet loss we start seeing a small fraction of significantly higher latencies, several of them starting in the 16ms+ bucket (that bucket ends just below the configured 34ms link latency). As the packet loss is increased, we see a roughly proportional growth in the fraction of high latency values caused by HOL blocking. At higher packet loss rates we even see some latencies streching into the 500ms+ range (likely from timeouts or the same segment needing to be retransmitted multiple times)

After

netstacklat_tcpread_withholfilt

It successfully avoids recording any latency values which suffer from considerable HOL-blocking. We can still see a little bit of a shift in the latency distribution for the non-HOL-blocked reads though.

Number of measurements

NOTE: I've updated this section, as in the initial posting I accidentally used the wrong hook (tcp-socket-enqueued) and then found no difference in number of measurement between before and after (which is to be expected, as this PR has no affect on the tcp-socket-enqueued hook). I've now updated this with correct figures for tcp-socket-read.

Before:

netstacklat_tcpread_count_withoutholfilt

After:

netstacklat_tcpread_count_withholfilt

As expected we see a small drop in the number of measurements when we exclude the potentially HOL-blocked reads. As the packet loss increases the relative reduction in samples grows. Overall, the reduction is pretty small though, suggesting that we're not discarding a large amount of valid samples.

eBPF program runtime

Before

netstacklat_runtime_withoutholfilt

After

netstacklat_runtime_withholfilt

It does clearly add a bit of overhead for the netstacklat_tcp_recv_timestamp program (which is the one recording the tcp-socket-read latency). When there's no packet loss, the difference appears to be quite small, with both before and after showing around 250ns per invocation. The overhead does however seem to grow as the packet loss increases, reaching around 400ns at 1% packet loss (vs around 280ns before). However, past 1% the overhead drops down to around 330ns (vs 280-320ns before).

Clearly there is some variation in the measured runtime here and the numbers are not super exact. Overall though, I would say that the additional overhead is measurable but most of the time not too problematic. As this shows the average overhead, we get a bit of a tradeoff from the additional work of looking up the ooo-range and comparing it to the last byte read, vs lower overhead in the instances we exclude the measurement and then don't have do the hashmap lookups to update the histogram.

Here we can also see the netstacklat_tcp_queue_ofo() program, which is responsible for tracking the upper ooo sequence range whenever there's an out-of-order packet. In the before case it's basically a NOP (but still measures in at 30ns here). In the after case it still only takes between 70-100ns. As it only runs when there are out-of-order packets, the overall overhead from this is fairly low. In this these runs, it only ran about 2.5% as frequently as netstacklat_tcp_recv_timestamp at 0.1% packet loss, and about 25% as frequently at 10% packet loss.

Total (estimated) eBPF CPU overhead

Before

netstacklat_estimatedcpu_withoutholfilt

After

netstacklat_estimatedcpu_withholfilt

This is the estimated CPU overhead for all eBPF programs from netstacklat (based on the cumulative runtime for each eBPF program invocation as measured by kernel.bpf_stats_enabled). The largest difference is at 0.1% packet loss, where the total CPU utilization increased by around 10% (from ~0.175% to ~0.195%). If we focus on only the affected programs, netstacklat_tcp_recv_timestamp and new netstacklat_tcp_data_queue_ofo, this increases to about 35% (from ~0.033% to 0.045%). For all the other cases the relative increase in overhead is lower, with around 25% at 1% packet loss, 10% at 10% loss, and 5% at 0% loss.

@simosund
Copy link
Contributor Author

When you have time @netoptimizer, could you check if tcp_data_queue_ofo() is traceable on your systems. I.e that you can attach an fentry program to it.

$ sudo bpftrace -lv 'fentry:tcp_data_queue_ofo'
kfunc:vmlinux:tcp_data_queue_ofo
    struct sock * sk
    struct sk_buff * skb

Discovered that on some systems I could not trace tcp_data_queue_ofo(), presumably because it's been inlined. If tcp_data_queue_ofo() is unavailable I'm not really sure how to solve this issue. Could possibly try to kprobe on tcp_data_queue() with the offset to where tcp_data_queue_ofo() is called (the only callsite), but kprobe with offsets seems cumbersome (based on Toke's blog post) and hardly portable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant