阅读视图

TCP Requests Stuck After Connection Established（2024）

2024年6月26日 08:00

This post describes a kernel & BPF networking problem and the trouble shooting steps, which is an interesting case for delving into Linux kernel networking intricacies.

Fig. Phenomenon of a reported issue.

1 Trouble report

1.1 Phenomenon: `probabilistic health check failures`

Users reported intermittent failures of their pods, despite them run as usual with no exceptions.

The health check is a very simple HTTP probe over TCP: kubelet periodically (e.g. every 5s) sends GET requests to local pods, initiating a new TCP connection with each request.

Fig. Intermittent health check failures of pods.

Users suspect this is a network problem.

1.2 Scope: specific pods on specific nodes

This reported issue is confined to a new k8s cluster, with recently introduced OS and kernel:

OS: AliOS (AlibabaCloud OS)
Kernel: cloud-kernel 5.10.134-16.al8.x86_64 (a fork of Linux, gitee.com/anolis/cloud-kernel), which includes their upstream feature backports and self-maintanined changes, for example,
1. Intel AMX (Advanced Matrix Extensions) for AI workloads, offering a hardware acceleration alternative to GPUs in certain scenarios, such as inference for LLMs smaller than 13B. AMX support was first introduced in kernel 5.16, cloud-kernel backported the feature to its current version 5.10;
2. cloud-kernel includes un-upstreamed modifications like new kernel structure fields and new enums/types.

Other environment info:

Cilium: self-maintained v1.11.10
CNCF Case Study: How Trip.com Group switched to Cilium For Scalable and Cloud Native Networking, 2023

2 Networking fundamentals

Before starting our exploration, let’s outline our networking infra in this cluster.

2.1 Node network topology: `Cilium (with BPF)`

Internal networking topology of our k8s node is depicted as below:

Fig. Internal networking topology of a k8s node.

(k8s node) $ route -n
Destination  Gateway   Genmask           Use Iface
0.0.0.0      <GW-IP>   0.0.0.0           eth0
<Node-IP>    0.0.0.0   <Node-IP-Mask>    eth0
<Pod1-IP>    0.0.0.0   255.255.255.255   lxc-1
<Pod2-IP>    0.0.0.0   255.255.255.255   lxc-2
<Pod3-IP>    0.0.0.0   255.255.255.255   lxc-3

As shown in the picture and kernel routing table output, each pod has a dedicated routing entry. Consequently, all health check traffic is directed straight to the lxc device (the host-side device of the pod’s veth pair), subsequently entering the Pod. In another word, all the health check traffic is processed locally.

Cilium has a similar networking topology on AlibabaCloud as on AWS. For more information, refer to Cilium Network Topology and Traffic Path on AWS (2019), which may contain some stale information, but most of the content should still validate.

2.2 Kernel `5.10+`: `sockmap BPF` acceleration for `node2localPod` traffic

2.2.1 `sockops` BPF: bypass kernel stack for local traffic

How to use eBPF for accelerating Cloud Native applications offers a practical example of how sockops/sockmap BPF programs work.

Chinese readers can also refer to the following for more information,

2.2.2 `tcpdump`: only TCP `3-way/4-way` handshake packets can be captured

sockops acceleration is automatically enabled in kernel 5.10 + Cilium v1.11.10:

Fig. Socket-level acceleration in Cilium. Note that the illustration depicts local processes communicating via loopback, which differs from the scenario discussed here, just too lazy draw a new picture.

One big conceptual change is that when sockops BPF is enabled, you could not see request & response packets in tcpdump output, as in this setup, only TCP 3-way handshake and 4-way close procedure still go through kernel networking stack, all the payload will directly go through the socket-level (e.g. in tcp/udp send/receive message) methods.

A quick test to illustrate the idea: access a server in pod from the node:

(node) $ curl <pod ip>:<port>

The output of tcpdump:

(pod) $ tcpdump -nn -i eth0 host <node ip> and <port>
# TCP 3-way handshake
IP NODE_IP.36942 > POD_IP.8080: Flags [S]
IP POD_IP.8080   > NODE_IP.36942: Flags [S.]
IP NODE_IP.36942 > POD_IP.8080: Flags [.]

# requests & responses, no packets go through there, they are bypassed,
# payloads are transferred directly in socket-level TCP methods

# TCP 4-way close
IP POD_IP.8080   > NODE_IP.36942: Flags [F.]
IP NODE_IP.36942 > POD_IP.8080: Flags [.]
IP NODE_IP.36942 > POD_IP.8080: Flags [F.]
IP POD_IP.8080   > NODE_IP.36942: Flags [.]

2.3 Summary

Now we’ve got a basic undertanding about the problem and environment. It’s time to delve into practical investigation.

3 Quick narrow-down

3.1 Quick reproduction

First, check kubelet log,

$ grep "Timeout exceeded while awaiting headers" /var/log/kubernetes/kubelet.INFO
prober.go] Readiness probe for POD_XXX failed (failure):
  Get "http://POD_IP:PORT/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
...

Indeed, there are many readiness probe failures.

Since the probe is very simple HTTP request, we can do it manually on the node, this should be equivalent to the kubelet probe,

$ curl <POD_IP>:<PORT>/v1/health
OK
$ curl <POD_IP>:<PORT>/v1/health
OK
$ curl <POD_IP>:<PORT>/v1/health # stuck
^C

OK, we can easily reproduce it without relying on k8s facilities.

3.2 Narrow-down the issue

Now let’s perform some quick tests to narrow-down the problem.

3.2.1 `ping`: OK, exclude L2/L3 problem

ping PodIP from node always succeeds.

(node) $ ping <POD_IP>

This indicates L2 & L3 (ARP table, routing table, etc) connectivity functions well.

3.2.2 `telnet` connection test: OK, exclude TCP connecting problem

(node) $ telnet POD_IP PORT
Trying POD_IP...
Connected to POD_IP.
Escape character is '^]'.

Again, always succeeds, and the ss output confirms the connections always enter ESTABLISHED state:

(node) $ netstat -antp | grep telnet
tcp        0      0 NODE_IP:34316    POD_IP:PORT     ESTABLISHED 2360593/telnet

3.2.3 Remote-to-localPod `curl`: OK, exclude pod problem & vanilla kernel stack problem

Do the same health check from a remote node, always OK:

(node2) $ curl <POD_IP>:<PORT>/v1/health
OK
...
(node2) $ curl <POD_IP>:<PORT>/v1/health
OK

This rules out issues with the pod itself and the vanilla kernel stack.

3.2.4 Local pod-to-pod: OK, exclude some node-internal problems

(pod3) $ curl <POD2_IP>:<PORT>/v1/health
OK
...
(pod3) $ curl <POD2_IP>:<PORT>/v1/health
OK

Always OK. Rule out issues with the pod itself.

3.3 Summary: `only node-to-localPod TCP requests stuck probabilistically`

Fig. Test cases and results.

The difference of three cases:

Node-to-localPod: payload traffic is processed via sockops BPF;
Local Pod-to-Pod: BPF redirection (or kernel stack, based on your kernel version)
- Differentiate three types of eBPF redirections (2022)
RemoteNode-to-localPod: standard kernel networking stack

Combining these information, we guess with confidence that the problem have relationships with sockops BPF and kernel (because kernel does most of the job in sockops BPF scenarios).

From these observations, it is reasonable to deduce that the issue is likely related to sockops BPF and the kernel, given the kernel’s central role in sockops BPF scenarios.

4 Dig deeper

Now let’s explore the issue in greater depth.

4.1 Linux vs. AliOS kernel

As we’ve been using kernel 5.10.56 and cilium v1.11.10 for years and haven’t met this problem before, the first reasonable assumption is that AliOS cloud-kernel 5.10.134 may introduce some incompatible changes or bugs.

So we spent some time comparing AliOS cloud-kernel with the upstream Linux.

Note: cloud-kernel is maintained on gitee.com, which restricts most read privileges (e.g. commits, blame) without logging in, so in the remaining of this post we reference the Linux repo on github.com for discussion.

4.1.1 Compare BPF features

First, compare BPF features automatically detected by cilium-agent on the node. The result is written to a local file on the node: /var/run/cilium/state/globals/bpf_features.h,

$ diff <bpf_features.h from our 5.10.56 node> <bpf_features.h from AliOS node>

59c59
< #define NO_HAVE_XSKMAP_MAP_TYPE
---
> #define HAVE_XSKMAP_MAP_TYPE
71c71
< #define NO_HAVE_TASK_STORAGE_MAP_TYPE
---
> #define HAVE_TASK_STORAGE_MAP_TYPE
243c243
< #define BPF__PROG_TYPE_socket_filter__HELPER_bpf_ktime_get_coarse_ns 0
---
> #define BPF__PROG_TYPE_socket_filter__HELPER_bpf_ktime_get_coarse_ns 1
...

There are indeed some differences, but with further investigation, we didn’t find any correlation to the observed issue.

4.1.2 AliOS cloud-kernel specific changes

Then we spent some time to check AliOS cloud-kernel self-maintained BPF and networking modifications. Such as,

b578e4b8ed6e1c7608e07e03a061357fd79ac2dd ck: net: track the pid who created socks

In this commit, they added a pid_t pid field to the struct sock data structure.
ea0307caaf29700ff71467726b9617dcb7c0d084 tcp: make sure init the accept_queue’s spinlocks once

But again, we didn’t find any correlation to the problem.

4.2 Check detailed TCP connection stats

Without valuable information from code comparison, we redirected our focus to the environment, collecting some more detailed connection information.

ss (socket stats) is a powerful and convenient tool for socket/connection introspection:

-i/--info: show internal TCP information, including couple of TCP connection stats;
-e/--extended: show detailed socket information, including inode, uid, cookie.

4.2.1 Normal case: `ss` shows correct `segs_out/segs_in`

Initiate a connection with nc (netcat),

(node) $ nc POD_IP PORT

We intentionally not use telnet here, because telnet will close the connection immediately after a request is served successfully, which leaves us no time to check the connection stats in ss output. nc will leave the connection in CLOSE-WAIT state, which is good enough for us to check the connection send/receive stats.

Now the stats for this connection:

(node) $ ss -i | grep -A 1 50504
tcp    ESTAB      0         0         NODE_IP:50504          POD_IP:PORT
         cubic wscale:7,7 rto:200 rtt:0.059/0.029 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 1963.4Mbps lastsnd:14641 lastrcv:14641 lastack:14641 pacing_rate 3926.8Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.059

Send & receive stats: segs_out=2, segs_in=1.

Now let’s send a request to the server: type GET /v1/health HTTP/1.1\r\n then press Enter,

Actually you can type anything and just Enter, the server will most likely send you a 400 (Bad Request) response, but for our case, this 400 indicate the TCP send/receive path is perfectly OK!

(node) $ nc POD_IP PORT
GET /v1/health HTTP/1.1\r\n
<Response Here>

We’ll get the response and the connection will just entering CLOSE-WAIT state, we have some time to check it before it vanishing:

(node) $ ss -i | grep -A 1 50504
tcp     CLOSE-WAIT   0      0        NODE_IP:50504     POD_IP:http
         cubic wscale:7,7 rto:200 rtt:0.059/0.029 ato:40 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 bytes_received:1 segs_out:3 segs_in:2 send 1963.4Mbps lastsnd:24277 lastrcv:24277 lastack:4399 pacing_rate 3926.8Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.059

As expected, segs_out=3, segs_in=2.

4.2.2 Abnormal case: `ss` shows incorrect `segs_out/segs_in`

Repeat the above test to capture a failed one.

On connection established,

$ ss -i | grep -A 1 57424
tcp      ESTAB      0       0         NODE_IP:57424    POD_IP:webcache
         cubic wscale:7,7 rto:200 rtt:0.056/0.028 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 2068.6Mbps lastsnd:10686 lastrcv:10686 lastack:10686 pacing_rate 4137.1Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.056

After typing the request content and stroking Enter:

(node) $ ss -i | grep -A 1 57424
tcp      ESTAB      0       0         NODE_IP:57424    POD_IP:webcache
         cubic wscale:7,7 rto:200 rtt:0.056/0.028 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 2068.6Mbps lastsnd:21994 lastrcv:21994 lastack:21994 pacing_rate 4137.1Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.056

That segments sent/received stats remain unchanged (segs_out=2,segs_in=1), suggesting that the problem may reside on tcp {send,receive} message level.

4.3 Trace related call stack

Based on the above hypothesis, we captured kernel call stacks to compare failed and successful requests.

4.3.1 `trace-cmd`: trace kernel call stacks

Trace 10 seconds, filter by server process ID, save the calling stack graph,

# filter by process ID (PID of the server in the pod)
$ trace-cmd record -P 178501 -p function_graph sleep 10

Caution: avoid tracing in production to prevent large file generation and excessive disk IO.

During this period, send a request,

(node) $ curl POD_IP PORT

By default, it will save data to a local file in the current directory, the content looks like this:

$ trace-cmd report > report-1.graph
CPU  1 is empty
CPU  2 is empty
...
CPU 63 is empty
cpus=64
   <idle>-0     [022] 5376816.422992: funcgraph_entry:    2.441 us   |  update_acpu.constprop.0();
   <idle>-0     [022] 5376816.422994: funcgraph_entry:               |  switch_mm_irqs_off() {
   <idle>-0     [022] 5376816.422994: funcgraph_entry:    0.195 us   |    choose_new_asid();
   <idle>-0     [022] 5376816.422994: funcgraph_entry:    0.257 us   |    load_new_mm_cr3();
   <idle>-0     [022] 5376816.422995: funcgraph_entry:    0.128 us   |    switch_ldt();
   <idle>-0     [022] 5376816.422995: funcgraph_exit:     1.378 us   |  }
...

Use | as delimiter (this preserves the calling stack and the proper leading whitespaces) and save the last fields into a dedicated file:

$ awk -F'|' '{print $NF}' report-1.graph > stack-1.txt

Compare them with diff or vimdiff:

$ vimdiff stack-1.txt stack-2.txt

Here are two traces, the left is a trace of a normal request, and the right is a problematic one:

Fig. Traces (call stacks) of a normal request (left side) and a problematic request (right side).

As can be seen, for a failed request, kernel made a wrong function call: it should call tcp_bpf_recvmsg() but actually called tcp_recvmsg().

4.3.2 Locate the code: `inet_recvmsg -> {tcp_bpf_recvmsg, tcp_recvmsg}`

Calling into tcp_bpf_recvmsg or tcp_recvmsg from inet_recvmsg is a piece of concise code, illustrated below,

// https://github.com/torvalds/linux/blob/v5.10/net/ipv4/af_inet.c#L838
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) {
    struct sock *sk = sock->sk;
    int addr_len = 0;
    int err;

    if (likely(!(flags & MSG_ERRQUEUE)))
        sock_rps_record_flow(sk);

    err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                  sk, msg, size, flags & MSG_DONTWAIT,
                  flags & ~MSG_DONTWAIT, &addr_len);
    if (err >= 0)
        msg->msg_namelen = addr_len;
    return err;
}

sk_prot ("socket protocol") contains handlers to this socket. INDIRECT_CALL_2 line can be simplified into the following pseudocode:

if sk->sk_prot->recvmsg == tcp_recvmsg: // if socket protocol handler is tcp_recvmsg
    tcp_recvmsg()
else:
    tcp_bpf_recvmsg()

This suggests that when requests fail, the sk_prot->recvmsg pointer of the socket is likely incorrect.

4.3.3 Double check with bpftrace

While trace-cmd is a powerful tool, it may contain too much details distracting us, and may run out of your disk space if set improper filter parameters.

bpftrace is a another tracing tool, and it won’t write data to local file by default. Now let’s double confirm the above results with it.

Again, run several times of curl POD_IP:PORT, capture only tcp_recvmsg and tcp_bpf_recvmsg calls, print kernel calling stack:

$ bpftrace -e 'k:tcp_recvmsg /pid==178501/ { printf("%s\n", kstack);} k:tcp_bpf_recvmsg /pid==178501/ { printf("%s\n", kstack);} '
        tcp_bpf_recvmsg+1                   # <-- correspond to a successful request
        inet_recvmsg+233
        __sys_recvfrom+362
        __x64_sys_recvfrom+37
        do_syscall_64+48
        entry_SYSCALL_64_after_hwframe+97

        tcp_bpf_recvmsg+1                   # <-- correspond to a successful request
        inet_recvmsg+233
        __sys_recvfrom+362
        __x64_sys_recvfrom+37
        do_syscall_64+48
        entry_SYSCALL_64_after_hwframe+97

        tcp_recvmsg+1                       # <-- correspond to a failed request
        inet_recvmsg+78
        __sys_recvfrom+362
        __x64_sys_recvfrom+37
        do_syscall_64+48
        entry_SYSCALL_64_after_hwframe+97

You could also filter by client program name (comm field in kernel data structure), such as,
$ bpftrace -e 'k:tcp_bpf_recvmsg /comm=="curl"/ { printf("%s", kstack); }'

As seen above, successful requests were directed to tcp_bpf_recvmsg, while failed ones were routed to tcp_recvmsg.

4.3.4 Summary

tcp_recvmsg waits messages from kernel networking stack, In the case of sockops BPF, messages bypass kernel stack, which explains why some requests fail (timeout), yet TCP connecting always OK.

We reported the above findings to the cloud-kernel team, and they did some further investigations with us.

4.4 `recvmsg` handler initialization in kernel stack

For short,

Fig. sockops BPF: connection establishement and socket handler initialization.

According to the above picture, recvmsg handler will be incorrectly initialized if to-be-inserted entry already exists sockmap (the end of step 3.1).

What’s the two entries of a connection looks like in BPF map:

(cilium-agent) $ bpftool map dump id 122 | grep "0a 0a 86 30" -C 2 | grep "0a 0a 65 f9" -C 2 | grep -C 2 "db 78"
0a 0a 86 30 00 00 00 00  00 00 00 00 00 00 00 00
0a 0a 65 f9 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00 1f 90 00 00  db 78 00 00
--
key:
--
0a 0a 65 f9 00 00 00 00  00 00 00 00 00 00 00 00
0a 0a 86 30 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00 db 78 00 00  1f 90 00 00

We’ll explain these binary data later. Now let’s first confirm our above assumption.

4.5 Confirm stale entries in sockmap

4.5.1 `bpftrace` `tcp_bpf_get_prot()`: incorrect socket handler (sk_prot)

Two sequent function calls that holding sk_port:

tcp_bpf_get_prot(): where sk_prot is initialized;
tcp_bpf_recvmsg() or tcp_recvmsg(): where sk_prot is called to receive a message;

Trace these two methods and print the sk_prot variable (pointer).

Successful case:

tcp_bpf_get_proto: src POD_IP (8080), dst NODE_IP(59500), 2232440
tcp_bpf_get_proto: 0xffffffffacc65800                                     # <-- sk_prot pointer
tcp_bpf_recvmsg: src POD_IP (8080), dst NODE_IP(59500) 0xffffffffacc65800 # <-- same pointer

Bad case:

(node) $ ./tcp_bpf_get_proto.bt 178501
Attaching 6 probes...
tcp_bpf_get_proto: src POD_IP (8080), dst NODE_IP(53904), 2231203
tcp_bpf_get_proto: 0xffffffffacc65800                                    # <-- sk_prot pointer
tcp_recvmsg: src POD_IP (8080), dst NODE_IP(53904) 0xffffffffac257300    # <-- sk_prot is modified!!!

4.5.2 `bpftrace` `sk_psock_drop`

A succesful case, calling into sk_psock_drop when requests finish and connection was normally closed:

(node) $ ./sk_psock_drop.bt 178501
tcp_bpf_get_proto: src POD_IP (8080), dst NODE_IP(59500), 2232440
tcp_bpf_get_proto: 0xffffffffacc65800                                    # <-- sk_prot pointer
sk_psock_drop: src POD_IP (8080)， dst NODE_IP(44566)
    sk_psock_drop+1
    sock_map_remove_links+161
    sock_map_close+50
    inet_release+63
    sock_release+58
    sock_close+17
    fput+147
    task_work_run+89
    exit_to_user_mode_loop+285
    exit_to_user_mode_prepare+110
    syscall_exit_to_user_mode+18
    entry_SYSCALL_64_after_hwframe+97
tcp_bpf_recvmsg: src POD_IP (8080), dst NODE_IP(59500) 0xffffffffacc65800 # <-- same pointer

A failed case, calling into sk_psock_drop when the server side calls sock_map_update() and the to-be-inserted entry already exists:

(node) $ ./sk_psock_drop.bt 178501
tcp_bpf_get_proto: src POD_IP (8080), dst NODE_IP(53904), 2231203
tcp_bpf_get_proto: 0xffffffffacc65800                                    # <-- sk_prot pointer
sk_psock_drop: src POD_IP (8080)， dst NODE_IP(44566)
    sk_psock_drop+1
    sock_hash_update_common+789
    bpf_sock_hash_update+98
    bpf_prog_7aa9a870410635af_bpf_sockmap+831
    _cgroup_bpf_run_filter_sock_ops+189
    tcp_init_transfer+333                       // -> bpf_skops_established -> BPF_CGROUP_RUN_PROG_SOCK_OPS -> cilium sock_ops code
    tcp_rcv_state_process+1430
    tcp_child_process+148
    tcp_v4_rcv+2491
    ...
tcp_recvmsg: src POD_IP (8080), dst NODE_IP(53904) 0xffffffffac257300    # <-- sk_prot is modified!!!

// https://github.com/torvalds/linux/blob/v6.5/net/core/sock_map.c#L464

static int sock_map_update_common(struct bpf_map *map, u32 idx, struct sock *sk, u64 flags) {
    struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
    ...

    link = sk_psock_init_link();
    sock_map_link(map, sk);
    psock = sk_psock(sk);

    osk = stab->sks[idx];
    if (osk && flags == BPF_NOEXIST) {     // sockmap entries already exists
        ret = -EEXIST;
        goto out_unlock;                   // goto out_unlock, which will release psock
    } else if (!osk && flags == BPF_EXIST) {
        ret = -ENOENT;
        goto out_unlock;
    }

    sock_map_add_link(psock, link, map, &stab->sks[idx]);
    stab->sks[idx] = sk;
    if (osk)
        sock_map_unref(osk, &stab->sks[idx]);
    return 0;                              // <-- should return from here
out_unlock:                                // <-- actually hit here
    if (psock)
        sk_psock_put(sk, psock);           // --> further call sk_psock_drop
out_free:
    sk_psock_free_link(link);
    return ret;
}

This early release of psock leads to the sk->sk_prot->recvmsg to be initialized as tcp_recvmsg.

4.5.3 bpftool: confirm stale connection info in sockops map

Key and value format in the BPF map:

// https://github.com/cilium/cilium/blob/v1.11.10/pkg/maps/sockmap/sockmap.go#L20

// SockmapKey is the 5-tuple used to lookup a socket
// +k8s:deepcopy-gen=true
// +k8s:deepcopy-gen:interfaces=github.com/cilium/cilium/pkg/bpf.MapKey
type SockmapKey struct {
    DIP    types.IPv6 `align:"$union0"`
    SIP    types.IPv6 `align:"$union1"`
    Family uint8      `align:"family"`
    Pad7   uint8      `align:"pad7"`
    Pad8   uint16     `align:"pad8"`
    SPort  uint32     `align:"sport"`
    DPort  uint32     `align:"dport"`
}

// SockmapValue is the fd of a socket
// +k8s:deepcopy-gen=true
// +k8s:deepcopy-gen:interfaces=github.com/cilium/cilium/pkg/bpf.MapValue
type SockmapValue struct {
    fd uint32
}

Trip.com: Large Scale Cloud Native Networking & Security with Cilium/eBPF, 2022 shows how to decode the encoded entries of Cilium BPF map.

$ cat ip2hex.sh
echo $1 | awk -F. '{printf("%02x %02x %02x %02x\n",$1,$2,$3,$4);}'
$ cat hex2port.sh
echo $1 | awk '{printf("0x%s%s 0x%s%s\n", $1, $2, $5, $6) }' | sed 's/ /\n/g' | xargs -n1 printf '%d\n'

(node) $ ./ip2hex.sh "10.10.134.48"
0a 0a 86 30
(node) $ ./ip2hex.sh "10.10.101.249"
0a 0a 65 f9

(cilium-agent) $ bpftool map dump id 122 | grep "0a 0a 86 30" -C 2 | grep "0a 0a 65 f9" -C 2 | grep -C 2 "db 78"
0a 0a 86 30 00 00 00 00  00 00 00 00 00 00 00 00
0a 0a 65 f9 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00 1f 90 00 00  db 78 00 00
--
key:
--
0a 0a 65 f9 00 00 00 00  00 00 00 00 00 00 00 00
0a 0a 86 30 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00 db 78 00 00  1f 90 00 00

(node) $ ./hex2port.sh "1f 90 00 00  b6 8a 00 00"
8080
46730 # you can verify this connection in `ss` output

Almost all of the following entries are stale (because this is an empty, no node-to-pod traffic unless we do manually):

(cilium-agent) $ bpftool map dump /sys/fs/bpf/cilium_sock_ops | grep "0a 0a 86 30" | wc -l
7325
(cilium-agent) $ bpftool map dump /sys/fs/bpf/cilium_sock_ops | grep "0a 0a 8c ca" | wc -l
1288
(cilium-agent) $ bpftool map dump /sys/fs/bpf/cilium_sock_ops | grep "0a 0a 8e 40" | wc -l
191

5 Technical summary

5.1 Normal sockops/sockmap BPF workflow

Fig. sockops BPF: connection establishement and socket handler initialization.

Node client (e.g. kubelet) -> server: initiate TCP connection to the server
Kernel (and the BPF code in kernel): on listening on connection established
1. write two entries to sockmap
2. link entries to bpf handlers (tcp_bpf_{sendmsg, recvmsg})
Node client (e.g. kubelet) -> server: send & receive payload: BPF handlers were executed
Node client (e.g. kubelet) -> server: close connection: kernel removes entries from sockmap

5.2 Direct cause

The problem arises in step 4, for an unknown reason, some entries are not deleted when connections closed. This leads to incorrect handler initialization in new connections in step 2 (or section 3.1 in the picture). When hit a stale entry,

sender side uses BPF message handlers for transmission;
server side treats the the socket as standard, and waits for message via default message handler, then stucks there as no payload goes to default handler.

5.3 Root cause

The Alibaba cloud-kernel team digged further into the issue, and thanks for their efforts, they finally found that bpf, sockmap: Remove unhash handler for BPF sockmap usage was the root cause, which was introduced in Linux 5.10.58. The AliOS kernel we were using was 5.10.134 based, so it suffered from this.

Upstream patch bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues has already fixed it, but it was only backported to 6.x series.

5.4 Quick restoration/remediation

If the issue already happened, you can use one of the following methods to restore:

Kernel restart: drain the node then restart it, thish will refresh the kernel state;
Manual clean with bpftool: with caution, avoid to remove valid entries.

5.5 Another issue with similar phenomenon

There is another issue with the similar phenomenon when sockops is enabled:

Local pod runs nginx (of recent versions, e.g. >= 1.18);
Sending http requests from node to the local pod, with a large enough cookie length (e.g. > 1024 Byte);

TCP connection will be OK, but requests will always stuck there.

Cilium issue:

ioctl FIONREAD returning incorrect value when sockops is enabled

nginx is reading the headers from the traefik request with a default value of 1024 (client_header_buffer_size 1k;) bytes and then (seemingly) asks via the ioctl how much data is left. Since the return is 0 the request is never fully read and does not proceed further.

Community solution:

deprecate –sockops-enable in v1.13, and remove the feature in v1.14

Appendix

bpftrace scripts used in this post

References

AliOS kernel (a Linux fork), gitee.com/anolis/cloud-kernel
Cilium Network Topology and Traffic Path on AWS (2019)
cilium v1.11.10, bpf_sockops.c
cilium v1.11.10, bpf sockops key & value definition
Differentiate three types of eBPF redirections
Trip.com: Large Scale Cloud Native Networking & Security with Cilium/eBPF, 2022

eBPF 介绍

酷壳 – CoolShell

陈皓

2022年12月10日 10:38

很早前就想写一篇关于eBPF的文章，但是迟迟没有动手，这两天有点时间，所以就来写一篇，这文章主要还是简单的介绍eBPF 是用来干什么的，并通过几个示例来介绍是怎么玩的，这个技术非常非常之强，Linux 操作系统的观测性实在是太强大了，并在 BCC 加持下变得一览无余。这个技术不是一般的运维人员或是系统管理员可以驾驭的，这个还是要有底层系统知识并有一定开发能力的技术人员才能驾驭的了的。我在这篇文章的最后给了个彩蛋。

介绍

eBPF（extened Berkeley Packet Filter）是一种内核技术，它允许开发人员在不修改内核代码的情况下运行特定的功能。eBPF 的概念源自于 Berkeley Packet Filter（BPF），后者是由贝尔实验室开发的一种网络过滤器，可以捕获和过滤网络数据包。

出于对更好的 Linux 跟踪工具的需求，eBPF 从 dtrace中汲取灵感，dtrace 是一种主要用于 Solaris 和 BSD 操作系统的动态跟踪工具。与 dtrace 不同，Linux 无法全面了解正在运行的系统，因为它仅限于系统调用、库调用和函数的特定框架。在Berkeley Packet Filter (BPF)（一种使用内核 VM 编写打包过滤代码的工具）的基础上，一小群工程师开始扩展 BPF 后端以提供与 dtrace 类似的功能集。 eBPF 诞生了。2014 年随 Linux 3.18 首次限量发布，充分利用 eBPF 至少需要 Linux 4.4 以上版本。

eBPF 比起传统的 BPF 来说，传统的 BPF 只能用于网络过滤，而 eBPF 则可以用于更多的应用场景，包括网络监控、安全过滤和性能分析等。另外，eBPF 允许常规用户空间应用程序将要在 Linux 内核中执行的逻辑打包为字节码，当某些事件（称为挂钩）发生时，内核会调用 eBPF 程序。此类挂钩的示例包括系统调用、网络事件等。用于编写和调试 eBPF 程序的最流行的工具链称为 BPF 编译器集合 (BCC)，它基于 LLVM 和 CLang。

eBPF 有一些类似的工具。例如，SystemTap 是一种开源工具，可以帮助用户收集 Linux 内核的运行时数据。它通过动态加载内核模块来实现这一功能，类似于 eBPF。另外，DTrace 是一种动态跟踪和分析工具，可以用于收集系统的运行时数据，类似于 eBPF 和 SystemTap。[1]

以下是一个简单的比较表格，可以帮助您更好地了解 eBPF、SystemTap 和 DTrace 这三种工具的不同之处：[1]

工具	eBPF	SystemTap	DTrace
定位	内核技术，可用于多种应用场景	内核模块	动态跟踪和分析工具
工作原理	动态加载和执行无损编译过的代码	动态加载内核模块	动态插接分析器，通过 probe 获取数据并进行分析
常见用途	网络监控、安全过滤、性能分析等	系统性能分析、故障诊断等	系统性能分析、故障诊断等
优点	灵活、安全、可用于多种应用场景	功能强大、可视化界面	功能强大、高性能、支持多种编程语言
缺点	学习曲线高，安全性依赖于编译器的正确性	学习曲线高，安全性依赖于内核模块的正确性	配置复杂，对系统性能影响较大

对比表格[1]

从上表可以看出，eBPF、SystemTap 和 DTrace 都是非常强大的工具，可以用于收集和分析系统的运行情况。[1]

用途

eBPF 是一种非常灵活和强大的内核技术，可以用于多种应用场景。下面是 eBPF 的一些常见用途：[1]

网络监控：eBPF 可以用于捕获网络数据包，并执行特定的逻辑来分析网络流量。例如，可以使用 eBPF 程序来监控网络流量，并在发现异常流量时进行警报。[1]
安全过滤：eBPF 可以用于对网络数据包进行安全过滤。例如，可以使用 eBPF 程序来阻止恶意流量的传播，或者在发现恶意流量时对其进行拦截。[1]
性能分析：eBPF 可以用于对内核的性能进行分析。例如，可以使用 eBPF 程序来收集内核的性能指标，并通过特定的接口将其可视化。这样，可以更好地了解内核的性能瓶颈，并进行优化。[1]
虚拟化：eBPF 可以用于虚拟化技术。例如，可以使用 eBPF 程序来收集虚拟机的性能指标，并进行负载均衡。这样，可以更好地利用虚拟化环境的资源，提高系统的性能和稳定性。[1]

总之，eBPF 的常见用途非常广泛，可以用于网络监控、安全过滤、性能分析和虚拟化等多种应用场景。[1]

工作原理

eBPF 的工作原理主要分为三个步骤：加载、编译和执行。

eBPF 需要在内核中运行。这通常是由用户态的应用程序完成的，它会通过系统调用来加载 eBPF 程序。在加载过程中，内核会将 eBPF 程序的代码复制到内核空间。

eBPF 程序需要经过编译和执行。这通常是由Clang/LLVM的编译器完成，然后形成字节码后，将用户态的字节码装载进内核，Verifier会对要注入内核的程序进行一些内核安全机制的检查,这是为了确保 eBPF 程序不会破坏内核的稳定性和安全性。在检查过程中，内核会对 eBPF 程序的代码进行分析，以确保它不会进行恶意操作，如系统调用、内存访问等。如果 eBPF 程序通过了内核安全机制的检查，它就可以在内核中正常运行了，其会通过通过一个JIT编译步骤将程序的通用字节码转换为机器特定指令集，以优化程序的执行速度。

下图是其架构图。

（图片来自：https://www.infoq.com/articles/gentle-linux-ebpf-introduction/）

在内核中运行时，eBPF 程序通常会挂载到一个内核钩子（hook）上，以便在特定的事件发生时被执行。例如，

系统调用——当用户空间函数将执行转移到内核时插入
函数进入和退出——拦截对预先存在的函数的调用
网络事件 – 在收到数据包时执行
Kprobes 和 uprobes – 附加到内核或用户函数的探测器

最后 eBPF Maps，允许eBPF程序在调用之间保持状态，以便进行相关的数据统计，并与用户空间的应用程序共享数据。一个eBPF映射基本上是一个键值存储，其中的值通常被视为任意数据的二进制块。它们是通过带有BPF_MAP_CREATE参数的bpf_cmd系统调用来创建的，和Linux世界中的其他东西一样，它们是通过文件描述符来寻址。与地图的交互是通过查找/更新/删除系统调用进行的

总之，eBPF 的工作原理是通过动态加载、执行和检查无损编译过的代码来实现的。[1]

示例

eBPF 可以用于对内核的性能进行分析。下面是一个基于 eBPF 的性能分析的 step-by-step 示例：

第一步：准备工作：首先，需要确保内核已经支持 eBPF 功能。这通常需要在内核配置文件中启用 eBPF 相关的选项，并重新编译内核。检查是否支持 eBPF，你可以用这两个命令查看 ls /sys/fs/bpf 和 lsmod | grep bpf

第二步：写 eBPF 程序：接下来，需要编写 eBPF 程序，用于收集内核的性能指标。eBPF 程序的语言可以选择 C 或者 Python，它需要通过特定的接口访问内核的数据结构，并将收集到的数据保存到指定的位置。

下面是一个Python 示例（其实还是C语言，用python来加载一段C程序到Linux内核）

#!/usr/bin/python3

from bcc import BPF
from time import sleep

# 定义 eBPF 程序
bpf_text = """
#include <uapi/linux/ptrace.h>

BPF_HASH(stats, u32);

int count(struct pt_regs *ctx) {
    u32 key = 0;
    u64 *val, zero=0;
    val = stats.lookup_or_init(&key, &zero);
    (*val)++;
    return 0;
}
"""

# 编译 eBPF 程序
b = BPF(text=bpf_text, cflags=["-Wno-macro-redefined"])

# 加载 eBPF 程序
b.attach_kprobe(event="tcp_sendmsg", fn_name="count")

name = {
  0: "tcp_sendmsg"
}
# 输出统计结果
while True:
    try:
        #print("Total packets: %d" % b["stats"][0].value)
        for k, v in b["stats"].items():
           print("{}: {}".format(name[k.value], v.value))
        sleep(1)
    except KeyboardInterrupt:
        exit()

这个 eBPF 程序的功能是统计网络中传输的数据包数量。它通过定义一个 BPF_HASH 数据结构来保存统计结果（eBPF Maps），并通过捕获 tcp_sendmsg 事件来实现实时统计。最后，它通过每秒输出一次统计结果来展示数据。这个 eBPF 程序只是一个简单的示例，实际应用中可能需要进行更复杂的统计和分析。

第三步：运行 eBPF 程序：接下来，需要使用 eBPF 编译器将 eBPF 程序编译成内核可执行的格式（这个在上面的Python程序里你可以看到——Python引入了一个bcc的包，然后用这个包，把那段 C语言的程序编译成字节码加载在内核中并把某个函数 attach 到某个事件上）。这个过程可以使用 BPF Compiler Collection（BCC）工具来完成。BCC 工具可以通过命令行的方式将 eBPF 程序编译成内核可执行的格式，并将其加载到内核中。

下面是运行上面的 Python3 程序的步骤：

sudo apt install python3-bpfcc

注：在Python3下请不要使用 pip3 install bcc （参看：这里）

如果你是 Ubuntu 20.10 以上的版本，最好通过源码安装（否则程序会有编译问题），参看：这里：

apt purge bpfcc-tools libbpfcc python3-bpfcc
wget https://github.com/iovisor/bcc/releases/download/v0.25.0/bcc-src-with-submodule.tar.gz
tar xf bcc-src-with-submodule.tar.gz
cd bcc/
apt install -y python-is-python3
apt install -y bison build-essential cmake flex git libedit-dev   libllvm11 llvm-11-dev libclang-11-dev zlib1g-dev libelf-dev libfl-dev python3-distutils
apt install -y checkinstall
mkdir build
cd build/
cmake -DCMAKE_INSTALL_PREFIX=/usr -DPYTHON_CMD=python3 ..
make
checkinstall

接下来，需要将上面的 Python 程序保存到本地，例如保存到文件 netstat.py。运行程序：最后，可以通过执行以下命令来运行 Python 程序：

$ chmod +x ./netstat.py
$ sudo ./netstat.py
tcp_sendmsg: 29
tcp_sendmsg: 216
tcp_sendmsg: 277
tcp_sendmsg: 379
tcp_sendmsg: 419
tcp_sendmsg: 468
tcp_sendmsg: 574
tcp_sendmsg: 645
tcp_sendmsg: 29

程序开始运行后，会在控制台输出网络数据包的统计信息。可以通过按 Ctrl+C 组合键来结束程序的运行。

下面我们再看一个比较复杂的示例，这个示例会计算TCP的发包时间（示例参考于Github上这个issue里的程序）：

#!/usr/bin/python3

from bcc import BPF
import time

# 定义 eBPF 程序
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <net/inet_sock.h>
#include <bcc/proto.h>

struct packet_t {
    u64 ts, size;
    u32 pid;
    u32 saddr, daddr;
    u16 sport, dport;
};

BPF_HASH(packets, u64, struct packet_t);

int on_send(struct pt_regs *ctx, struct sock *sk, struct msghdr *msg, size_t size)
{
    u64 id = bpf_get_current_pid_tgid();
    u32 pid = id;

    // 记录数据包的时间戳和信息
    struct packet_t pkt = {}; // 结构体一定要初始化，可以使用下面的方法
                              //__builtin_memset(&pkt, 0, sizeof(pkt)); 
    pkt.ts = bpf_ktime_get_ns();
    pkt.size = size;
    pkt.pid = pid;
    pkt.saddr = sk->__sk_common.skc_rcv_saddr;
    pkt.daddr = sk->__sk_common.skc_daddr;
    struct inet_sock *sockp = (struct inet_sock *)sk;
    pkt.sport = sockp->inet_sport;
    pkt.dport = sk->__sk_common.skc_dport;

    packets.update(&id, &pkt);
    return 0;
}

int on_recv(struct pt_regs *ctx, struct sock *sk)
{
    u64 id = bpf_get_current_pid_tgid();
    u32 pid = id;

    // 获取数据包的时间戳和编号
    struct packet_t *pkt = packets.lookup(&id);
    if (!pkt) {
        return 0;
    }

    // 计算传输时间
    u64 delta = bpf_ktime_get_ns() - pkt->ts;

    // 统计结果
    bpf_trace_printk("tcp_time: %llu.%llums, size: %llu\\n", 
       delta/1000, delta%1000%100, pkt->size);

    // 删除统计结果
    packets.delete(&id);

    return 0;
}
"""

# 编译 eBPF 程序
b = BPF(text=bpf_text, cflags=["-Wno-macro-redefined"])

# 注册 eBPF 程序
b.attach_kprobe(event="tcp_sendmsg", fn_name="on_send")
b.attach_kprobe(event="tcp_v4_do_rcv", fn_name="on_recv")

# 输出统计信息
print("Tracing TCP latency... Hit Ctrl-C to end.")
while True:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
        print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
    except KeyboardInterrupt:
        exit()

上面这个程序通过捕获每个数据包的时间戳来统计传输时间。在捕获 tcp_sendmsg 事件时，记录数据包的发送时间；在捕获 tcp_v4_do_rcv 事件时，记录数据包的接收时间；最后，通过比较两个时间戳来计算传输时间。

从上面的两个程序我们可以看到，eBPF 的一个编程的基本方法，这样的在Python里向内核的某些事件挂载一段 “C语言” 的方式就是 eBPF 的编程方式。实话实说，这样的代码很不好写，而且有很多非常诡异的东西，一般人是很难驾驭的（上面的代码我也不是很容易都能写通的，把 Google 都用了个底儿掉，读了很多晦涩的文档……）好在这样的代码已经有人写了，我们不必再写了，在 Github 上的 bcc 库下的 tools 目录有很多……

BCC（BPF Compiler Collection）是一套开源的工具集，可以在 Linux 系统中使用 BPF（Berkeley Packet Filter）程序进行系统级性能分析和监测。BCC 包含了许多实用工具，如：

bcc-tools：一个包含许多常用的 BCC 工具的软件包。
bpftrace：一个高级语言，用于编写和执行 BPF 程序。
tcptop：一个实时监控和分析 TCP 流量的工具。
execsnoop：一个用于监控进程执行情况的工具。
filetop：一个实时监控和分析文件系统流量的工具。
trace：一个用于跟踪和分析函数调用的工具。
funccount：一个用于统计函数调用次数的工具。
opensnoop：一个用于监控文件打开操作的工具。
pidstat：一个用于监控进程性能的工具。
profile：一个用于分析系统 CPU 使用情况的工具。

下面这张图你可能见过多次了，你可以看看他可以干多少事，内核里发生什么事一览无余。

彩蛋

最后来到彩蛋环节。因为最近 ChatGPT 很火，于是，我想通过 ChatGPT 来帮助我书写这篇文章，一开始我让ChatGPT 帮我列提纲，并根据提纲生成文章内容，并查找相关的资料，非常之顺利，包括生成的代码，我以为我们以很快地完成这篇文章。

但是，到了代码生成的时候，我发现，ChatGPT 生成的代码的思路和方法都是对的，但是是比较老的，而且是跑不起来的，出现了好些低级错误，如：使用了未声明的变量，没有引用完整的C语言的头文件，没有正确地初始化变量，错误地获取数据，类型没有匹配……等等，在程序调试上，挖了很多的坑，C语言本来就不好搞，挖的很多运行时的坑很难察觉，所以，耗费了我大量的时间来排除各种各样的问题，其中有环境上的问题，还有代码上的问题，这些问题即便是通过 Google 也不容易找到解决方案，我找到的解决方案都放在文章中了，尤其是第二个示例，让我调试了3个多小时，读了很多 bcc 上的issue和相关的晦涩的手册和文档，才让程序跑通。

到了文章收关的阶段，我让ChatGPT 给我几个延伸阅读，也是很好的，但是没有给出链接，于是我只得人肉 Google 了一下，然后让我吃惊的是，好多ChatGPT给出来的文章是根本不存在的，完全是它伪造的。我连让它干了两次都是这样，这个让我惊掉大牙。这让我开始怀疑它之前生成的内容，于是，我不得我返回仔细Review我的文章，尤其是“介绍”、“用途”和“工作原理”这三个章节，基本都是ChatGPT生成的，在Review完后，我发现了ChatGPT 给我生造了一个叫 “无损编译器”的术语，这个术语简直了，于是我开始重写我的文章。我把一些段落重写了，有一些没有，保留下来的我都标记上了 [1]，大家读的时候要小心阅读。

最后，我的结论是，ChatGPT只是一个不成熟的玩具，只能回答一些没有价值的日常聊天的问题，要说能取代Google，我觉得不可能，因为Google会基于基本的事实，而ChatGPT会基于内容生成的算法，在造假方面称得上是高手，可以列为电信诈骗的范畴了，我以后不会再使用ChatGPT生成文章内容或是作我的帮手了。StackOverflow把其ban了真是不能太赞了！

附件一：ChatGPT的造假载图和样本

ChatGPT 生成的样本一

Brendan Gregg 的《BPF Performance Tools: Linux System and Application Observability》一书是一个全面的指南，涵盖了 eBPF 的基础知识和实践应用。
Alexei Starovoitov 在 2015 年的 LPC 大会上发表的《eBPF and XDP: fast programmable datapath in the Linux kernel》一文，是一个介绍 eBPF 和 XDP（eXpress Data Path）的综述。
谷歌的《BPF: The Future of Linux Monitoring and Observability》一文介绍了 eBPF 的基本概念，并探讨了它对于 Linux 监控和可观察性的影响。
微软的《A Survey of Techniques for Hardware Acceleration of BPF》一文，介绍了 eBPF 在硬件加速方面的研究现状。
Daniel Borkmann 的《Linux kernel networking: implementation and theory》一书涵盖了 eBPF 的实现和理论。
Martin KaFai Lau 在 2015 年的 SOSP 大会上发表的《eBPF: a new execution environment for the Linux kernel》一文，介绍了 eBPF 在 Linux 内核中的新执行环境。
来自两名谷歌工程师的《eBPF: A New Frontier in Performance Analysis》一文，探讨了 eBPF 对性能分析的影响。
《BPF Performance Tools: Linux System and Application Observability》一书中的第三章，讨论了 eBPF 的编译器和运行时系统。
来自三名谷歌工程师的《eBPF: The New Solaris DTrace》一文，比较了 eBPF 和 Solaris DTrace 系统的区别。

ChatGPT 生成的样本二

eBPF 程序开发指南（https://cilium.io/doc/ebpf/ebpf-guide/）
eBPF 官方文档（https://www.kernel.org/doc/Documentation/networking/filter.txt）
Linux 内核 eBPF 介绍（https://www.ibm.com/developerworks/cn/linux/kernel/ebpf/index.html）
eBPF 性能优化技巧（https://www.infoq.com/cn/news/2018/08/ebpf-performance-tips/）
eBPF 实战经验分享（https://blog.cloudflare.com/ebpf-in-practice/）
eBPF 在网络安全中的应用（https://www.zdnet.com/article/ebpf-how-to-use-it-for-network-security/）
eBPF 实现 Linux 系统性能监控（https://www.percona.com/blog/2017/08/15/how-to-use-ebpf-to-monitor-linux-system-performance/）
eBPF 入门教程（https://sysdig.com/blog/ebpf-getting-started/）
eBPF 与 BPF 比较（https://lwn.net/Articles/724647/）
eBPF 提高课程（https://www.pluralsight.com/courses/ebpf-advanced）

附件二：发明的术语：无损编译器

（全文完）

（转载本站文章请注明作者和出处酷壳 – CoolShell ，请勿用于任何商业用途）

The post eBPF 介绍 first appeared on 酷壳 - CoolShell.