The Cilium Story: Remarkable Network Stability Improvement from a Minor Code Change
Introduction
A short while ago, I observed a Pull Request for the Cilium project from a former colleague.
bpf:nat: Restore ORG NAT entry if it's not found
The modification itself (excluding test code) is minimal, involving the addition of a single if
statement block. However, the impact of this change is immense, and I personally find it fascinating how a simple idea can contribute significantly to system stability. Therefore, I intend to explain this case in a manner that is easily comprehensible even to individuals without specialized knowledge in the field of networking.
Background Knowledge
If there is one essential item for modern individuals as crucial as a smartphone, it would likely be the Wi-Fi router. A Wi-Fi router communicates with devices via the Wi-Fi communication standard and facilitates the sharing of its public IP address among multiple devices. A technical peculiarity arises here: how does it perform this "sharing"?
The technology employed for this purpose is Network Address Translation (NAT). NAT is a technology that, recognizing that TCP or UDP communication consists of a combination of an IP address and port information, enables internal communication, which is composed of private IP:Port
, to communicate externally by mapping it to an currently unused public IP:Port
.
NAT
When an internal device attempts to access the external internet, a NAT device transforms the combination of that device's private IP address and port number into its own public IP address and an arbitrary, unused port number. This translation information is recorded in a NAT table within the NAT device.
For instance, consider a scenario where a smartphone inside a home (private IP: a.a.a.a
, port: 50000) attempts to connect to a web server (public IP: c.c.c.c
, port: 80).
1Smartphone (a.a.a.a:50000) ==> Router (b.b.b.b) ==> Web Server (c.c.c.c:80)
Upon receiving the smartphone's request, the router will observe the following TCP packet:
1# TCP packet received by the router, Smartphone => Router
2| src ip | src port | dst ip | dst port |
3-------------------------------------------
4| a.a.a.a | 50000 | c.c.c.c | 80 |
If this packet were sent directly to the web server (c.c.c.c
), a response would not return to the smartphone (a.a.a.a
) which possesses a private IP address. Therefore, the router first identifies an arbitrary port not currently involved in communication (e.g., 60000) and records it in its internal NAT table.
1# Router's internal NAT table
2| local ip | local port | global ip | global port |
3-----------------------------------------------------
4| a.a.a.a | 50000 | b.b.b.b | 60000 |
After recording a new entry in the NAT table, the router modifies the source IP address and port number of the TCP packet received from the smartphone to its own public IP address (b.b.b.b
) and the newly allocated port number (60000), then transmits it to the web server.
1# TCP packet sent by the router, Router => Web Server
2# SNAT performed
3| src ip | src port | dst ip | dst port |
4-------------------------------------------
5| b.b.b.b | 60000 | c.c.c.c | 80 |
Now, the web server (c.c.c.c
) recognizes this as a request originating from port 60000 of the router (b.b.b.b
) and sends a response packet to the router as follows:
1# TCP packet received by the router, Web Server => Router
2| src ip | src port | dst ip | dst port |
3-------------------------------------------
4| c.c.c.c | 80 | b.b.b.b | 60000 |
Upon receiving this response packet, the router retrieves the original private IP address (a.a.a.a
) and port number (50000) corresponding to the destination IP address (b.b.b.b
) and port number (60000) from its NAT table, and then modifies the packet's destination to the smartphone.
1# TCP packet sent by the router, Router => Smartphone
2# DNAT performed
3| src ip | src port | dst ip | dst port |
4-------------------------------------------
5| c.c.c.c | 80 | a.a.a.a | 50000 |
Through this process, the smartphone perceives as if it is directly communicating with the web server using its own public IP address. Thanks to NAT, multiple internal devices can simultaneously access the internet using a single public IP address.
Kubernetes
Kubernetes possesses one of the most sophisticated and complex network architectures among recent technologies. Furthermore, NAT, as previously mentioned, is utilized in various contexts. The following are two prominent examples:
When a Pod initiates communication with the cluster's external environment.
Pods within a Kubernetes cluster are typically assigned private IP addresses that allow communication only within the cluster network. Therefore, when a Pod needs to communicate with the external internet, NAT is required for outbound traffic from the cluster. In this scenario, NAT is primarily performed on the Kubernetes node (each server in the cluster) where the Pod is running. When a Pod sends an outbound packet, it is first transmitted to the node where the Pod resides. The node then alters the source IP address of this packet (the Pod's private IP) to its own public IP address and appropriately modifies the source port before forwarding it externally. This process is analogous to the NAT process described earlier for Wi-Fi routers.
For example, assuming a Pod within a Kubernetes cluster (10.0.1.10
, port: 40000) connects to an external API server (203.0.113.45
, port: 443), the Kubernetes node will receive the following packet from the Pod:
1# TCP packet received by the node, Pod => Node
2| src ip | src port | dst ip | dst port |
3---------------------------------------------------
4| 10.0.1.10 | 40000 | 203.0.113.45 | 443 |
The node then records the following information:
1# Node's internal NAT table (example)
2| local ip | local port | global ip | global port |
3---------------------------------------------------------
4| 10.0.1.10 | 40000 | 192.168.1.5 | 50000 |
After performing SNAT as follows, the packet is sent externally.
1# TCP packet sent by the node, Node => API Server
2# SNAT performed
3| src ip | src port | dst ip | dst port |
4-----------------------------------------------------
5| 192.168.1.5 | 50000 | 203.0.113.45 | 443 |
Subsequently, the process follows the same sequence as in the smartphone router example.
When communicating with a Pod from outside the cluster via NodePort.
One method of exposing services externally in Kubernetes is by utilizing NodePort services. A NodePort service opens a specific port (NodePort) on all nodes within the cluster and forwards traffic entering this port to the Pods associated with the service. External users can access the service via the cluster node's IP address and NodePort.
In this scenario, NAT plays a crucial role, and specifically, DNAT (Destination NAT) and SNAT (Source NAT) occur simultaneously. When traffic arrives at a specific node's NodePort from an external source, the Kubernetes network must ultimately forward this traffic to the Pod providing that service. During this process, DNAT occurs first, changing the packet's destination IP address and port number to the Pod's IP address and port number.
For example, assume an external user (203.0.113.10
, port: 30000) accesses a service through a NodePort (30001
) on a Kubernetes cluster node (192.168.1.5
). This service is internally pointing to a Pod with IP address 10.0.2.15
and port 8080
.
1External User (203.0.113.10:30000) ==> Kubernetes Node (External:192.168.1.5:30001 / Internal: 10.0.1.1:42132) ==> Kubernetes Pod (10.0.2.15:8080)
Here, the Kubernetes node possesses both the externally accessible IP address 192.168.1.5
and the internally valid IP address 10.0.1.1
within the Kubernetes network. (Policies related to this vary depending on the type of CNI used, but this article explains based on Cilium.)
When an external user's request arrives at the node, the node must forward this request to the Pod that will process it. At this point, the node applies the following DNAT rule to modify the packet's destination IP address and port number:
1# TCP packet being prepared by the node to send to the Pod
2# After DNAT application
3| src ip | src port | dst ip | dst port |
4---------------------------------------------------
5| 203.0.113.10 | 30000 | 10.0.2.15 | 8080 |
A critical point here is that when the Pod sends a response to this request, its source IP address will be its own IP address (10.0.2.15
), and the destination IP address will be the IP address of the external user who sent the request (203.0.113.10
). In such a case, the external user will receive a response from a non-existent IP address that they never requested, and they will simply DROP that packet. Therefore, the Kubernetes node performs an additional SNAT when the Pod sends a response packet externally, changing the packet's source IP address to the node's IP address (either 192.168.1.5
or the internal network IP 10.0.1.1
; in this case, 10.0.1.1
is used).
1# TCP packet being prepared by the node to send to the Pod
2# After DNAT, SNAT application
3| src ip | src port | dst ip | dst port |
4---------------------------------------------------
5| 10.0.1.1 | 40021 | 10.0.2.15 | 8080 |
Now, the Pod that receives this packet will respond to the node that initially received the request via NodePort, and the node will reverse the same DNAT and SNAT processes to return the information to the external user. During this process, each node will store the following information:
1# Node's internal DNAT table
2| original ip | original port | destination ip | destination port |
3------------------------------------------------------------------------
4| 192.168.1.5 | 30001 | 10.0.2.15 | 8080 |
5
6# Node's internal SNAT table
7| original ip | original port | destination ip | destination port |
8------------------------------------------------------------------------
9| 203.0.113.10 | 30000 | 10.0.1.1 | 42132 |
Main Discussion
Typically, in Linux, these NAT processes are managed and operated by the conntrack subsystem via iptables. In fact, other CNI projects such as Flannel and Calico utilize this to address the aforementioned issues. However, the problem is that Cilium, by using eBPF technology, bypasses this traditional Linux network stack entirely. 🤣
As a result, Cilium has chosen to directly implement only the functionalities required for Kubernetes scenarios among the tasks traditionally handled by the Linux network stack, as shown in the diagram above. Therefore, regarding the SNAT process mentioned earlier, Cilium directly manages the SNAT table in the form of an LRU Hash Map (BPF_MAP_TYPE_LRU_HASH).
1# Cilium SNAT table
2# !Example for easy explanation. Actual definition: https://github.com/cilium/cilium/blob/v1.18.0-pre.1/bpf/lib/nat.h#L149-L166
3| src ip | src port | dst ip | dst port | protocol, conntrack, and other metadata
4----------------------------------------------
5| | | | |
And, as it is a Hash Table, for fast lookups, there is a key value, which uses the combination of src ip
, src port
, dst ip
, dst port
as the key value.
Problem Identification
Phenomenon - 1: Lookup
Consequently, one problem arises: when a packet traverses eBPF, it must query the aforementioned Hash Table to verify whether it needs to perform an SNAT or DNAT process. As previously observed, there are two types of packets involved in the SNAT process: 1. outbound packets from internal to external, and 2. inbound packets from external to internal as a response. These two packets require transformation during the NAT process and are characterized by swapped source IP, port, and destination IP, port values.
Therefore, for fast lookups, either an additional value with swapped source and destination values must be added to the Hash Table as a key, or the same Hash Table must be queried twice for all packets, even those unrelated to SNAT. Naturally, Cilium, for better performance, adopted the method of inserting the same data twice under the name RevSNAT.
Phenomenon - 2: LRU
Furthermore, independent of the above issue, infinite resources cannot exist on any hardware, and especially in a hardware-level logic that demands high performance, dynamic data structures cannot be utilized. In such situations, when resources become scarce, existing data needs to be evicted. Cilium resolved this by using the LRU Hash Map, a fundamental data structure provided by Linux.
Phenomenon 1 + Phenomenon 2 = Connection Loss
https://github.com/cilium/cilium/issues/31643
This implies that for a single SNATed TCP (or UDP) connection:
- The same data is recorded twice in a single Hash Table for both outbound and inbound packets.
- Due to the LRU logic, either of these two data entries can be lost at any time.
If even one of the NAT information entries (hereinafter "entry") for an outbound or inbound packet is removed by LRU, NAT cannot be performed correctly, leading to the loss of the entire connection.
Solution
Here, the aforementioned Pull Requests come into play:
bpf:nat: restore a NAT entry if its REV NAT is not found
bpf:nat: Restore ORG NAT entry if it's not found
Previously, when a packet traversed eBPF, it attempted a lookup in the SNAT table using a key composed of the source IP, source port, destination IP, and destination port. If the key did not exist, new NAT information was generated according to the SNAT rule and recorded in the table. For a new connection, this would lead to normal communication. However, if the key was unintentionally removed by LRU, new NAT would be performed using a different port than the one used for the existing communication, causing the receiving end to reject the packet, and the connection would terminate with an RST packet.
The approach taken by the above PR is straightforward:
If a packet is observed in either direction, update the entry for its reverse direction as well.
When communication is observed in either direction, both entries are updated, moving them away from being priorities for eviction in the LRU logic. This reduces the possibility of a scenario where the entire communication collapses due to the deletion of only one entry.
This may seem like a very simple approach and a straightforward idea, but through this approach, the problem of connection loss due to the premature expiration of NAT information for response packets has been effectively resolved, significantly enhancing system stability. It can also be considered an important improvement that has achieved the following results in terms of network stability:
Conclusion
I consider this Pull Request an excellent example that demonstrates how a fundamental understanding of computer science regarding NAT operations, combined with a simple idea, can bring about significant changes even within complex systems.
Of course, I did not directly present examples of complex systems in this article. However, to properly comprehend this PR, I pleaded with DeepSeek V3 0324
for nearly three hours, even adding the word 'Please', and as a result, I gained knowledge about Cilium +1 and obtained the following diagram. 😇
And after reading through the issues and PRs, I decided to write this article as a form of compensatory satisfaction for the ominous premonition that an issue might have arisen due to something I had created in the past.
Postscript - 1
Incidentally, there is a highly effective method to circumvent this issue. Since the root cause of the issue is insufficient NAT table space, one can simply increase the NAT table size. :-D
While someone else might have encountered the same issue, increased the NAT table size, and then left without reporting it, I admire and respect the passion of gyutaeb for thoroughly analyzing and understanding the problem, despite it not being directly related to him, and for contributing to the Cilium ecosystem with objective supporting data.
This was the motivation for my decision to write this article.
Postscript - 2
This topic is not directly aligned with Gosuda, which primarily focuses on the Go language. However, given the close relationship between the Go language and the cloud ecosystem, and the fact that Cilium contributors generally possess some proficiency in Go, I decided to bring content that could be posted on a personal blog to Gosuda.
Since I, one of the administrators, have given permission, I believe it should be acceptable.
If you believe it is not acceptable, you might want to save it as a PDF quickly, as it might be deleted at any time. ;)
Postscript - 3
This article was greatly assisted by Cline and Llama 4 Maveric. Although I began the analysis with Gemini and pleaded with DeepSeek, I ultimately received assistance from Llama 4. Llama 4 is excellent. You should definitely try it.