A Race Within A Race: Exploiting CVE-2025-38617 in Linux Packet Sockets A step-by-step guide to exploiting a 20-year-old bug in the Linux kernel to achieve full privilege escalation and container escape, plus a cool bug-hunting heuristic. Calif Mar 03, 2026 19 1 5 Share Table of Contents Introduction Background Packet Sockets Ring Buffers and TPACKET_V3 Extended Attributes and simple_xattr Slab Allocator vs Page Allocator Kernel Heap Mitigations The Vulnerability The Conditional Zeroing Bug The Race Window and UAF The Key Insight: Sleeping Mutex Holders Stretch Race Windows The Exploit Stage 0: Winning the Races Stage 1: Page Overflow Primitive (via xattr corruption) Stage 2: Heap Read/Write via pgv Overlap Stage 3: Arbitrary Page Read/Write via pgv Overlap Stage 4: KASLR Bypass via Pipe Buffer Stage 5: Privilege Escalation via Syscall Patching The Fix Takeaways Introduction CVE-2025-38617 is a use-after-free vulnerability in the Linux kernel’s packet socket subsystem, caused by a race condition between packet_set_ring() and packet_notifier(). The bug has existed since Linux 2.6.12 (2005) and was fixed in kernel version 6.16. It allows an unprivileged local attacker — needing only CAP_NET_RAW, obtainable through user namespaces — to achieve full privilege escalation and container escape. The vulnerability and exploits were discovered and developed by Quang Le, a member of Calif , and submitted as part of Google’s kernelCTF program. Calif provides this complimentary write-up to offer additional background for educational purposes. This article analyzes the vulnerability, the exploit submission , and the two-line fix. The exploit is notable for its sophistication: it defeats modern kernel mitigations including CONFIG_RANDOM_KMALLOC_CACHES and CONFIG_SLAB_VIRTUAL, builds exploit primitives through a chain of four increasingly powerful stages, and uses creative timing techniques to win two separate race conditions deterministically. But perhaps the most interesting aspect is the bug-finding heuristic it demonstrates: when a mutex holder sleeps, the time window between lock release and the next critical operation becomes predictable and stretchable, turning otherwise unexploitable code sequences into reliable race conditions. Affected versions : Linux 2.6.12 through 6.15 Affected component : net/packet/af_packet.c (packet socket subsystem) Root cause : Race condition leading to use-after-free Required capability : CAP_NET_RAW (available via unprivileged user namespaces) Fix commit : 01d3c8417b9c Background Packet Sockets Linux packet sockets (AF_PACKET) provide raw access to network interfaces at the link layer. They’re used by tools like tcpdump and wireshark to capture network traffic. When a packet arrives on a network interface, the kernel delivers a copy to any packet socket “hooked” to that interface through a registered protocol hook function. Packet sockets have a lifecycle tied to network interface state: When the interface goes UP , the packet socket’s protocol hook is registered, and the socket enters the PACKET_SOCK_RUNNING state. It can now receive packets. When the interface goes DOWN , the hook is unregistered, and the socket stops receiving packets. These transitions are managed by packet_notifier(), which handles NETDEV_UP and NETDEV_DOWN events. Ring Buffers and TPACKET_V3 For high-performance packet processing, packet sockets support memory-mapped ring buffers. Instead of copying each packet through recvmsg(), the kernel writes packets directly into a shared memory region that userspace can mmap(). The ring buffer is configured through setsockopt() with PACKET_RX_RING (for receiving) or PACKET_TX_RING (for transmitting), which internally calls packet_set_ring(). The ring buffer consists of multiple “blocks,” each a contiguous allocation of kernel pages. These blocks are tracked by an array of struct pgv pointers: struct pgv { char *buffer; // pointer to one block of pages }; The alloc_pg_vec() function allocates this array and each block: static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) { unsigned int block_nr = req->tp_block_nr; struct pgv *pg_vec; pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL | __GFP_NOWARN); for (i = 0; i < block_nr; i++) { pg_vec[i].buffer = alloc_one_pg_vec_page(order); } return pg_vec; } How userspace accesses ring buffer blocks: mmap(). When userspace calls mmap() on a packet socket file descriptor, the kernel’s packet_mmap() handler walks the pgv array and maps each block’s pages into the calling process’s virtual address space as a single contiguous region. Block 0’s pages appear first, followed by block 1’s pages, and so on. The result is that userspace gets a pointer to a memory region where offset 0 is the start of block 0, offset block_size is the start of block 1, etc. Reads/writes to this region go directly to the kernel pages backing the ring buffer, with no syscall overhead. This mapping is based on what pgv[N].buffer points to a...
CVE-2025-38617 is a use-after-free vulnerability in the Linux kernel's packet socket subsystem (AF_PACKET), exploitable via a race condition between `packet_set_ring()` and `packet_notifier()` that allows an attacker with CAP_NET_RAW to achieve privilege escalation and container escape. It has a CVSS 3.1 score of 4.7 (MEDIUM). Affected versions are Linux kernel 2.6.13 through 5.4.296, 5.5 through 5.10.240, 5.11 through 5.15.189, 5.16 through 6.1.147, 6.2 through 6.6.101, and 6.12 through 6.15.9, with fixes available in versions 5.4.297, 5.10.241, 5.15.190, 6.1.148, 6.6.102, 6.12.42, 6.15.10, and 6.16.1.