3 Kernel Prerequisites

This chapter does not try to teach the Linux kernel from scratch. It only builds the pieces needed to follow the Copy Fail path. Some background materials are listed in Section 1.3 for a deeper study path, especially a book like Linux Kernel Programming which tells how exactly kernel works beneath the hood.

3.1 File-backed Execution and Page Cache

3.1.1 File-backed Executable Mappings

A Linux process does not see raw physical RAM directly. Instead, it runs inside its own virtual address space (VAS), which is divided into several mappings. A mapping is just a virtual memory region with:

  • a backing source
  • and a set of permissions

This is the very basic of Linux internal — Linux Kernel Programming displays a well-crafted process VAS diagram. The relevant part for us here is that the executable text region and shared libraries appear as mappings inside the process address space.

Some mappings are anonymous, meaning they are not backed by any file, such as heap growth or freshly allocated memory. Others are file-backed, meaning their contents come from a file mapped into memory. Executable code is usually reached through file-backed mappings: the program's .text region and shared libraries (e.g., libc.so.6) are mapped into the process VAS rather than copied byte-for-byte into a private buffer.

To understand Copy Fail, we need a deeper dive beyond the generic VAS view. A file-backed executable mapping is not "just memory"; it is a userspace virtual range that the kernel ties back to a file object.

 userspace                                    kernel
═════════════════════════════════════════════════════════════════


 virtual address 0x5... inside .text

        │ process executes from mapped ELF region

┌─────────────────────────────┐
│ VMA / userspace mapping     │
│ r-x file-backed region      │ e.g.
│ e.g. /usr/bin/su .text      │ 555555554000-555555556000 r--p /usr/bin/su
└──────────────┬──────────────┘ 555555556000-55555557a000 r-xp /usr/bin/su

               │ page-table lookup
               │ virtual → physical translation

               └────────────────────────────────┐

                                  ┌─────────────────────────────┐
                                  │     page-table entries      │ 
        0x5555... → phys 0x1ab000 │ virtual → physical mapping  │ 
                                  └──────────────┬──────────────┘

                                                 │ points to

                                  ┌─────────────────────────────┐
                                  │       struct page           │
                   e.g. PFN 0x1ab │     physical page frame     │
                                  └──────────────┬──────────────┘

                                                 │ represents cached bytes from

                                  ┌─────────────────────────────┐
                                  │       page cache page       │
          e.g. file offset 0x2000 │    /usr/bin/su + 0x2000     │
                                  └──────────────┬──────────────┘

                                                 │ tracked by

                                  ┌─────────────────────────────┐
                                  │     inode(/usr/bin/su)      │
                                  │  address_space cache tree   │
                                  └──────────────┬──────────────┘

                                                 │ originally loaded from

                                  ┌─────────────────────────────┐
                                  │        file on disk         │
                                  │         /usr/bin/su         │
                                  └─────────────────────────────┘

The exact page-fault-handling details can wait until later in 3.1.3. For now, the important idea is narrower: a file-backed executable mapping still points back to a real file and file offset, rather than to an anonymous private buffer.

3.1.2 ELF Loading Through Execve

When a program is launched, the kernel handles it through execve(). For a normal ELF binary, the kernel parses the ELF metadata and creates a new process image whose mappings correspond to the ELF's loadable segments.

From the kernel's perspective:

 userspace                             kernel
════════════════════════════════════════════════════════

 process calls syscall
 execve("/usr/bin/su")
        │               context switch -> kernel mode
        └────────────────────────────────┐


                            ┌─────────────────────────┐
                            │ handle execve syscall   │
                            └────────────┬────────────┘


                            ┌─────────────────────────┐
                            │ readfile /usr/bin/su    │
                            │ parse ELF headers       │
                            │ inspect PT_LOAD entries │
                            └────────────┬────────────┘


                            ┌─────────────────────────┐
                            │ create userspace VMAs   │
                            │ backed by /usr/bin/su   │
                            │                         │
                            │ .text   → r-x           │
                            │ .rodata → r--           │
                            │ .data   → rw-           │
                            └────────────┬────────────┘


                            ┌─────────────────────────┐
                            │  [!] connect VMAs to    │
                            │   page-cache pages      │
                            └────────────┬────────────┘

                                         │ returns to user mode
        ┌────────────────────────────────┘


process begins execution
at ELF entry point using mapped .text pages

The code-level evidence is in the ELF loader. During load_elf_binary(), the kernel maps each PT_LOAD segment through elf_map(), and elf_map() ultimately calls vm_mmap(filep, ...) on the executable file:

C
static unsigned long elf_map(struct file *filep, unsigned long addr,
                            const struct elf_phdr *eppnt, int prot, int type,
                            unsigned long total_size)
{
    ...
    map_addr = vm_mmap(filep, addr, size, prot, type, off);
    ...
}

So for an executable like /usr/bin/su, execve() does not first construct a private in-memory image of .text. It creates VMAs whose backing file is still /usr/bin/su; later execution faults resolve against that file-backed mapping.

For Copy Fail, the important segment is .text: the machine code that the CPU will actually execute. Under normal conditions, this region is mapped read-only and executable.

axura @ labyrinth :~
axura@pwnlab:~$ objdump -h /usr/bin/su | head -n5 \
&& objdump -h /usr/bin/su | grep text -A1

/usr/bin/su:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
 15 .text         00005d02  0000000000003f80  0000000000003f80  00003f80  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

Those READONLY and CODE properties are the core security descriptors. At runtime, the corresponding /usr/bin/su executable mapping appears as r-xp:

axura @ labyrinth :~
pwndbg> vmmap
LEGEND: STACK | HEAP | CODE | DATA | WX | RODATA
             Start                End Perm     Size  Offset File (set vmmap-prefer-relpaths on)
    0x555f8bd7e000     0x555f8bd81000 r--p     3000       0 /usr/bin/su
    0x555f8bd81000     0x555f8bd88000 r-xp     7000    3000 /usr/bin/su
    0x555f8bd88000     0x555f8bd8a000 r--p     2000    a000 /usr/bin/su
    0x555f8bd8a000     0x555f8bd8b000 r--p     1000    c000 /usr/bin/su
    0x555f8bd8b000     0x555f8bd8c000 rw-p     1000    d000 /usr/bin/su
    ...

Here, p indicates a private mapping, meaning modifications are handled privately through mechanisms such as copy-on-write rather than being written back as a shared writable file mapping; while the pathname /usr/bin/su shows that this VMA is backed by an executable file on disk.

This is the part that can feel counterintuitive to a userspace pwner: the r-xp permission only describes the userspace VMA. It does not mean the underlying file-backed cache page is impossible to corrupt through a kernel bug.

If an attacker can modify those backing bytes, the mapping may still look read-only from userspace, while the instructions fetched from the page cache have already changed.

Once that boundary is broken, the most dangerous target class is obvious: root-owned SUID binaries such as /usr/bin/su. If we can hijack the page-cache-backed .text bytes of an SUID executable, the next execution may fetch attacker-controlled instructions from cache while the process runs with elevated privileges.

So the real question becomes:

Can an unprivileged user corrupt page-cache-backed executable bytes for a read-only file?

3.1.3 Linux Page Cache

The page cache is the kernel's in-memory cache for file-backed data. When a process reads, maps, or executes a file, Linux can serve the bytes from cached pages in RAM instead of fetching them from disk every time.

For Copy Fail, the page cache is the critical target:

file on disk

    │ read / mmap / execve

page-cache page

    │ mapped into process as file-backed memory

userspace VMA

3.1.3.1 Runtime Cache Inspection

Linux exposes system-wide file-cache accounting through /proc/meminfo:

axura @ labyrinth :~
axura@pwnlab:~$ grep -E 'Cached|Buffers|Active\\(file\\)|Inactive\\(file\\)' /proc/meminfo
Buffers:           42512 kB
Cached:          1090328 kB
SwapCached:            0 kB
axura@pwnlab:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           7.7Gi       1.3Gi       5.6Gi        44Mi       1.1Gi       6.4Gi
Swap:          3.8Gi          0B       3.8Gi

In this lab, both commands reported about 1.1 GiB in cache usage, confirming that the kernel is already holding a substantial amount of file-backed data in memory.

We can then narrow that view to a single file:

axura @ labyrinth :~
axura@pwnlab:~$ vmtouch -v /usr/bin/su
/usr/bin/su
[OOOOOOOOOOOOOO] 14/14

           Files: 1
     Directories: 0
  Resident Pages: 14/14  56K/56K  100%
         Elapsed: 6.4e-05 seconds

This shows that /usr/bin/su currently has resident file-backed pages in page cache. However, it does not yet prove how executable .text mappings reach those pages.

3.1.3.2 Page-cache Object Model

To reason about page-cache corruption, we need the kernel object model. On Linux, executable mappings are ultimately backed by regular files on disk. At the filesystem layer, each regular file is represented by an inode, which owns the file's page-cache mapping:

axura @ labyrinth :~
axura@pwnlab:~$ stat /usr/bin/su
  File: /usr/bin/su
  Size: 55680           Blocks: 112        IO Block: 4096   regular file
Device: 8,2     Inode: 2383706     Links: 1
Access: (4755/-rwsr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2026-05-13 12:32:47.267708839 +0800
Modify: 2026-03-07 00:00:54.000000000 +0800
Change: 2026-05-13 11:44:17.032955828 +0800
Birth: 2026-05-13 11:44:16.979947665 +0800

For page-cache purposes, the important members of struct inode are i_mapping and the embedded i_data:

C
struct inode {
    ...
    struct address_space *i_mapping;  // pointer to page-cache mapping for this file
    ...
    struct address_space i_data;      // stores metadata about the file's cached pages
    ...
};

So each inode links to the file's page cache, represented by a per-file struct address_space:

C
struct address_space {
    struct inode            *host;
    struct xarray           i_pages;
    ...
    struct rb_root_cached   i_mmap;
    ...
};

The fields that matter here are:

  • host: which file this cache belongs to
  • i_pages: the xarray that stores cached file folios/pages by file offset
  • i_mmap: the set of user VMAs mapping this file's cached contents

For /usr/bin/su, the corresponding inode owns an address_space object; that address_space stores cached file data in i_pages and tracks user mappings in i_mmap.

Conceptually:

file on disk: /usr/bin/su


      ┌───────────────┐
      │     inode     │
      └───────┬───────┘
              │ owns

      ┌─────────────────────────────┐
      │       address_space         │
      │                             │
      │ host    -> this inode       │
      │ i_pages -> cached folios    │────▶ cached executable page, e.g. file offset 0x2000
      │ i_mmap  -> mapped VMAs      │────▶ VMA in proc A: /usr/bin/su r-xp
      └─────────────────────────────┘      VMA in proc B: /usr/bin/su r-xp
                                           VMA in proc C: /usr/bin/su r-xp
                                           ...

The same address_space object, namely the page cache, can then be reused by future processes mapping the same file.

3.1.3.3 Executable Fault Path

The missing link aforementioned is the page fault path: when an executable mapping (.text) needs bytes from /usr/bin/su, where does the kernel fetch them from?

For file-backed VMAs, the page fault handler is filemap_fault(). Its first fields expose the chain directly:

C
vm_fault_t filemap_fault(struct vm_fault *vmf)
{
    // file mapped by the faulting VMA 
    struct file *file = vmf->vma->vm_file;

    // page-cache mapping associated with the file
    struct address_space *mapping = file->f_mapping;

    // inode owning this page-cache mapping
    struct inode *inode = mapping->host;

    // page-cache index for the faulting offset
    pgoff_t index = vmf->pgoff;
    ...

    /*  
    * Do we have something in the page cache already?  
    */  
    folio = filemap_get_folio(mapping, index);
        // Try to find the cached folio/page for this file offset
    ...
}

This proves that executable bytes are resolved through the file's page-cache mapping. So again this is the core idea of Copy Fail: If that cached page has been corrupted, the read-only VMA thus was also corrupted from the origin.

3.1.3.4 Cache Lookup

As we can see from the execution flow of filemap_fault(), when a process faults on a file on disk through a file-backed executable VMA, the kernel starts locating the file from vmf->vma->vm_file, follows file->f_mapping, and then looks up the corresponding cached folio/page from the file's page-cache mapping.

The final lookup operation is done by filemap_get_folio(mapping, index), which reaches __filemap_get_folio():

C
struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
                                fgf_t fgp_flags, gfp_t gfp)
{
    struct folio *folio;

repeat:

    // lookup cached folio/page by file offset 
    folio = filemap_get_entry(mapping, index);

    ...

    if (!folio)
        goto no_page;

    ...

    return folio;

no_page:

    // on a cache miss, allocate and insert a new folio
    // if the caller requested creation
    if (!folio && (fgp_flags & FGP_CREAT)) {

        ...

        folio = filemap_alloc_folio(alloc_gfp, order);

        ...

        err = filemap_add_folio(mapping, folio, index, gfp);

        ...
    }

    return folio;
}

That gives the cache behavior:

  • cache hit → reuse existing folio for this file offset
  • cache miss → allocate folio → insert into file mapping → later accesses may reuse it

This is why page-cache corruption is powerful. Once a file-backed folio is modified in memory, later reads, mappings, or executions of the same file offset may observe the modified cached bytes until the cache is invalidated or reloaded.

3.1.3.5 Buffered Read Path

The same cache object is also used by ordinary buffered file reads. generic_file_read_iter() dispatches into filemap_read(), whose comment states the model directly:

C
/**
 * filemap_read - Read data from the page cache.
 * @iocb: The iocb to read.
 * @iter: Destination for the data.
 * @already_read: Number of bytes already read by the caller.
 *
 * Copies data from the page cache.  If the data is not currently present,
 * uses the readahead and read_folio address_space operations to fetch it.
 *
 * ...
 */
ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
        ssize_t already_read) { ... }

3.1.3.6 Page-cache Takeaway

Execution faults and buffered reads both converge on the same model: the page cache is the shared, kernel-resident copy of file data consulted by file-backed mappings and ordinary file reads.

system boot
══════════════════════════════════════════════════════

 initialize memory manager
 initialize page allocator
 initialize page-cache infrastructure

                │ no file pages cached yet

         page cache starts empty


runtime
══════════════════════════════════════════════════════

process executes, reads, or maps /usr/bin/su

                │ execve() / read() / mmap()

    kernel requests file-backed data

                │ load from disk on cache miss

    ┌──────────────────────────────┐
    │ page cache                   │
    │ cached /usr/bin/su pages     │
    │ executable file data in RAM  │
    └──────────────┬───────────────┘

                   │ reused by later access

          proc A → /usr/bin/su
          proc B → /usr/bin/su
          proc C → /usr/bin/su

So if we manage to corrupt the cached page for a file offset, later consumers of that same file-backed data may observe the corrupted bytes even when the read-only on-disk file remains unchanged. That is the page-cache side of Copy Fail.

3.1.4 Page-cache Write Boundary

Userspace can read file-backed pages, execute them, and even create private writable copies (e.g. the r-xp VMA). The security boundary is narrower:

An unprivileged path must not mutate the shared page-cache page behind a file unless it has legitimate write authority over that file-backed state.

This distinction matters because executable file pages are often shared kernel-resident state:

  • a private write must create an anonymous copy
  • a shared write must require write permission to the file

Direct mutation of a cached executable page through an unrelated kernel path violates that boundary.

The mmap() path encodes this boundary before a file-backed VMA is created. In do_mmap(), a writable shared mapping is rejected unless the file was opened writable:

C
case MAP_SHARED:
case MAP_SHARED_VALIDATE:

        // shared writable mmap requires writable file descriptor
        if (prot & PROT_WRITE) {
                if (!(file->f_mode & FMODE_WRITE))
                        return -EACCES;
                ...
        }

        ...

        // remove VMA write/share capability if fd is read-only
        if (!(file->f_mode & FMODE_WRITE))
                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

The VMA permission bits are defined in include/linux/mm.h:

  • VM_WRITE means the mapping is currently writable.
  • VM_MAYWRITE means the mapping may later become writable through mprotect().

The page-fault path preserves the same boundary. sanitize_fault_flags() rejects write faults on mappings that lack VM_MAYWRITE as invalid and raises SIGSEGV. For private writable mappings, do_wp_page() resolves the write fault through wp_page_copy(), creating a private anonymous copy instead of modifying the shared file-backed page.

The normal model is:

Mapping / OperationBehavior
Read/execute file-backed pageAllowed, backed directly by page cache
Private writable mapping (MAP_PRIVATE)Allowed through Copy-on-Write (COW)
Shared writable mapping (MAP_SHARED)Requires FMODE_WRITE
Write into read-only page cacheForbidden

The normal boundary is visible from userspace. A shared writable mapping (MAP_SHARED | PROT_WRITE) of /usr/bin/su through a read-only file descriptor is denied as expected:

axura @ labyrinth :~
axura@pwnlab:~$ sudo python3 - <<'PY'
import mmap, os

fd = os.open("/usr/bin/su", os.O_RDONLY)
try:
    mmap.mmap(fd, 4096, prot=mmap.PROT_READ | mmap.PROT_WRITE,
              flags=mmap.MAP_SHARED)
    print("unexpected: shared writable mapping succeeded")
except OSError as e:
    print(f"MAP_SHARED|PROT_WRITE failed as expected: {e}")
finally:
    os.close(fd)
PY
MAP_SHARED|PROT_WRITE failed as expected: [Errno 13] Permission denied

The same file can still be mapped privately with write permission. In that case, the write is process-local: it updates the private mapping and leaves the file-backed page-cache contents unchanged.

axura @ labyrinth :~
axura@pwnlab:~$ python3 - <<'PY'
import mmap, os

path = "/usr/bin/su"
fd = os.open(path, os.O_RDONLY)
before = open(path, "rb").read(1)

m = mmap.mmap(fd, 4096, prot=mmap.PROT_READ | mmap.PROT_WRITE,
              flags=mmap.MAP_PRIVATE)
m[0:1] = b"X"
after_file = open(path, "rb").read(1)

print("private mapping byte:", m[0:1])
print("file byte unchanged:", after_file == before)

m.close()
os.close(fd)
PY
private mapping byte: b'X'
file byte unchanged: True

The two results match the kernel-side invariant. The shared writable mapping fails with EACCES, while the private mapping shows b'X' only inside the process-local view and confirms that the file byte is unchanged. Copy Fail becomes security-critical because it reaches the cached executable page from a different subsystem: the attacker does not need a normal writable file mapping if the crypto path can be tricked into writing into that shared page directly.

These two results match the kernel-side invariant:

  • MAP_SHARED | PROT_WRITE → denied without file write authority
  • MAP_PRIVATE | PROT_WRITE → allowed, but writes go to a private COW page

This design is safe and solid — for a normal writable file mapping. But Copy Fail reaches the cached executable page from a different subsystem.

Next we will dive into that special subsystem, the kernel crypto path, which can be tricked into writing a tiny piece of page cache — that's why we call it a scratch write primitive.

3.2.1 Pipe Buffer Model

At the userspace API level, a pipe behaves like a byte stream: write() pushes bytes in, and read() pulls bytes out.

The classic model looks like this:

C
int fd[2];
pipe(fd);

write(fd[1], "hello", 5);

char buf[16];
read(fd[0], buf, 5);

That abstraction is correct for ordinary programming, but it is not the model that matters for Copy Fail. Inside the kernel, a pipe is not one flat byte array. It is a ring of buffer descriptors, and each descriptor points to the memory backing part of the stream.

That distinction matters because splice() can make a pipe carry references to existing pages instead of freshly copied anonymous data.

The kernel pipe object is struct pipe_inode_info. The fields relevant here are head, tail, ring_size, and bufs:

C
struct pipe_inode_info {
    ...
    unsigned int head;
    unsigned int tail;
    unsigned int ring_size;
    struct pipe_buffer *bufs;  // ring of pipe_buffer descriptors
    ...
};

So the internal shape is:

pipe_inode_info
    └── bufs[] ring
            ├── pipe_buffer
            ├── pipe_buffer
            └── ...

Each slot is a struct pipe_buffer:

C
struct pipe_buffer {
    struct page *page;                     // backing page
    unsigned int offset, len;              // where the data starts inside that page
    const struct pipe_buf_operations *ops; // how many bytes are valid
    unsigned int flags;                    // rules for handling this buffer
    unsigned long private;                 // buffer state
};

The important point is that a pipe slot does not just mean "here are some bytes." It means: "the current data lives in this page, starting at this offset, for this length, under these buffer-specific rules".

The ordinary pipe_write() path often copies userspace bytes into pipe-owned pages through copy_page_from_iter(). But the data structure itself is more general:

A pipe_buffer can also describe pages supplied by other kernel paths.

Although at userspace level, the pipe still looks like a normal byte stream:

axura @ labyrinth :~
axura@pwnlab:~$ python3 - <<'PY'
import os
r, w = os.pipe()
os.write(w, b"hello pipe")
print(os.read(r, 10))
os.close(r)
os.close(w)
PY
b'hello pipe'

The output is just b'hello pipe' — that is the abstraction userspace sees. Copy Fail depends on what this abstraction hides: the pipe internally advances through pipe_buffer descriptors, and those descriptors may later refer to non-anonymous file-backed pages supplied by other kernel subsystems.

3.2.2 Page References in Pipe Buffers

Like a class in Python or an object in Java, struct pipe_buffer is not a flat stream as mentioned before. And it does not just say which bytes are visible (page, len) through the pipe; through ops and flags, it also tells later kernel code what kind of backing page this is and how that page may be handled.

                    struct pipe_inode_info
            +--------------------------------------+
            |      head / tail / ring_size         |
            |                                      |
            |            bufs[] ring               |
            |   tail                 head          |
            |     v                    v           |
            |  +------+------+------+------+---+   |
            |  | buf0 | buf1 | buf2 | buf3 |...|   |
            |  +------+------+------+------+---+   |
            |  indices advance modulo ring_size    |
            +--------------------------------------+
                      |       |       |
                      v       v       v

                struct pipe_buffer descriptors
                +-----------------------------+
                | page   -> backing page      |
                | offset -> start in page     |
                | len    -> visible bytes     |
                | ops    -> buffer operations |
                | flags  -> buffer provenance |
                +---------------+-------------+
                                |
                +---------------+---------------+
                |                               |
                v                               v
        anonymous page               file-backed page-cache page
         from ordinary write()        from splice()

That last split is the reason this structure matters for Copy Fail. A pipe slot can describe ordinary pipe-owned memory, but it can also describe a file-backed page that came from another kernel path.

The policy side is expressed through struct pipe_buf_operations:

C
struct pipe_buf_operations {
    int  (*confirm)(...);    // validate buffer before use
    void (*release)(...);    // drop references / cleanup
    bool (*try_steal)(...);  // transfer page ownership if possible
    bool (*get)(...);        // acquire another page reference
};

To understand how pipe buffers connect the backing pages, and also make a comparion to the later 3.2.3 splice movement, we can observe how an ordinary write() works.

First, it reaches pipe_write() and installs anon_pipe_buf_ops:

C
static const struct pipe_buf_operations anon_pipe_buf_ops = {
    .release       = anon_pipe_buf_release,
    .try_steal     = anon_pipe_buf_try_steal,
    .get           = generic_pipe_buf_get,
};

...

buf->ops = &anon_pipe_buf_ops;  // dispatch pipe operations

From that point onward, later pipe operations dispatch through buf->ops:

pipe_buffer
    └── ops
          └── anon_pipe_buf_ops
                ├── .get       -> generic_pipe_buf_get()
                ├── .release   -> anon_pipe_buf_release()
                └── .try_steal -> anon_pipe_buf_try_steal()

The callbacks operate on the backing struct page, the first member of struct pipe_buffer.

generic_pipe_buf_get() acquires another reference to the same page:

C
bool generic_pipe_buf_get(struct pipe_inode_info *pipe,
                          struct pipe_buffer *buf)
{
    return try_get_page(buf->page);
}

generic_pipe_buf_release() later drops that reference:

C
void generic_pipe_buf_release(struct pipe_inode_info *pipe,
                              struct pipe_buffer *buf)
{
    put_page(buf->page);
}

generic_pipe_buf_try_steal() shows the same page-reference model from the ownership side:

C
bool generic_pipe_buf_try_steal(struct pipe_inode_info *pipe,
                                struct pipe_buffer *buf)
{
    struct page *page = buf->page;

    // Only steal if this is the last remaining page reference
    if (page_count(page) == 1) {

            // 1 means the pipe holds the only remaining page reference
            lock_page(page);     // caller now receives the locked page
            return true;         // ownership transfer succeeded
    }

    return false;                // still referenced elsewhere; cannot steal
}

The takeaway is simple:

pipe_buffer = page reference + byte range + handling policy

For ordinary write(), the referenced page is usually an anonymous pipe-owned page. For splice(), the referenced page may be a file-backed page-cache page. That is the bridge Copy Fail needs: later consumers may think they are reading a pipe byte stream (pipe_buffer), but the pipe slot may point to cached file data.

3.2.3 Zero-copy Transfer Model

The term zero-copy sounds more magical than it really is — it is the first ironic half of the name "Copy Fail".

But that does not mean "no kernel work happens." It means the kernel avoids copying bytes into a fresh intermediate buffer when it can pass around a reference to an existing page.

The previous section established the key abstraction, that pipe_buffer is already capable of naming an existing page and attaching policy to it — once a pipe buffer can point at an existing page, a producer does not always need to allocate a new anonymous pipe page and copy bytes into it. It can sometimes install a reference to a page that already exists elsewhere.

ordinary copied path
════════════════════════════════════════

  source bytes

        │ copy_page_from_iter()

    anonymous pipe-owned page


pipe_buffer -> new page


zero-copy style path
════════════════════════════════════════

  existing page
e.g. page-cache page

        │ install page reference

pipe_buffer -> existing page

The contrast is visible in the ordinary pipe_write() path introduced above. There, the kernel allocates or reuses an anonymous pipe page and then copies userspace bytes into it with copy_page_from_iter():

C
copied = copy_page_from_iter(page, offset, chars, from);
buf->ops = &anon_pipe_buf_ops;

That was NOT zero-copy: the data is materially copied into a pipe-owned page. By contrast, the page-reference model from the previous section allows another path to populate buf->page with an already existing page and then let later consumers operate on that same page through buf->ops.

That is not zero-copy. The bytes are materially copied into an anonymous pipe page.

The contrast looks like this:

    ordinary write()              splice()-style path
════════════════════════════════════════════════════════

    userspace buffer             file-backed page cache
  +-------------------+          +--------------------+
  | e.g. "hello pipe" |          | existing file page |
  +----------+--------+          +----------+---------+
             |                              |
             | copy_page_from_iter()        | reference page
             v                              v
 ┌─────────────────────┐          ┌─────────────────────┐
 │ anonymous pipe page │          │  file-backed page   │
 └───────────┬─────────┘          └─────────┬───────────┘
             |                              |
             | copied                       | zero-copy
             v                              v    
       +-----------------------------------------+
        |             struct pipe_buffer          |
        |-----------------------------------------|
        |                buf->page                |
        +-----------------------------------------+

So in practice, zero-copy means page-backed state plus reference management via the pointer struct page *page. The bytes stay where they already are; what moves between kernel subsystems is the metadata that grants access to that page.

So the next question is: What's the magician behind that? — instead of copying file bytes into an anonymous pipe page, it's the splice() call who makes the pipe buffer reference the file's existing page-cache page directly.

3.2.4 Splice-backed Page Movement

With the pipe-buffer model in place, splice() stops looking like a quirky read/write shortcut. For Copy Fail, it is the zero-copy mechanism that moves page references across kernel subsystems.

There are two stages to keep separate:

  1. file → pipe: a file-backed page-cache page is installed into a pipe buffer.
  2. pipe → next consumer : that same pipe buffer can be forwarded into another kernel subsystem, such as a socket send path.

So the exploit-relevant shape is:

file-backed page cache

        │ splice(file -> pipe)

pipe_buffer references file page

        │ splice(pipe -> consumer)

next kernel consumer receives page-backed data

3.2.4.1 File -> Pipe

At the syscall layer, splice() first resolves the userspace file descriptors into kernel struct file objects, then dispatches into the internal splice engine:

C
SYSCALL_DEFINE6(splice, 
        int, fd_in, loff_t __user *, off_in,  // was userspace fd
        int, fd_out, loff_t __user *, off_out,
        size_t, len, unsigned int, flags)
        /* splice(int fd_in, ..., int fd_out, ...) */
{
    struct fd in, out;
    ssize_t error;

    ...

    in = fdget(fd_in);
    if (in.file) {
        out = fdget(fd_out);
        if (out.file) {
            /* 
             * transition 1
             * fd_in / fd_out are now kernel struct file objects
             */
            error = __do_splice(in.file, off_in, out.file, off_out,
                        len, flags);
            fdput(out);
        }
        fdput(in);
    }
    return error;
}

So the syscall boundary converts:

  • userspace fd_in → source struct file
  • userspace fd_out → destination struct file

Those objects enter __do_splice(), where the kernel checks whether either endpoint is a pipe:

C
static ssize_t __do_splice(struct file *in, loff_t __user *off_in,
                           struct file *out, loff_t __user *off_out,
                           size_t len, unsigned int flags)
{
    struct pipe_inode_info *ipipe;
    struct pipe_inode_info *opipe;
        loff_t offset, *__off_in = NULL, *__off_out = NULL;
    ssize_t ret;

    // Detect whether the input or output endpoint is a pipe-
    ipipe = get_pipe_info(in, true);
    opipe = get_pipe_info(out, true);

    ...

    // transition 2
    return do_splice(in, __off_in, out, __off_out, len, flags);
}

The direction is selected in do_splice():

C
ssize_t do_splice(struct file *in, loff_t *off_in,
                  struct file *out, loff_t *off_out,
                  size_t len, unsigned int flags)
{
    ...

    ipipe = get_pipe_info(in, true);
    opipe = get_pipe_info(out, true);

    // if both are pipes
    if (ipipe && opipe) {
        ret = splice_pipe_to_pipe(ipipe, opipe, len, flags);

    // if only input is pipe
    } else if (ipipe) {
        ret = do_splice_from(ipipe, out, &offset, len, flags);

    // if only output is pipe
    } else if (opipe) {
        /*
         * transition 3
         *
         * Copy Fail first-stage direction:
         *
         *     regular file -> pipe
         */
        ret = splice_file_to_pipe(in, opipe, &offset, len, flags);

    } else {
        ret = -EINVAL;
    }

    ...

    return ret;
}

For the first Copy Fail stage:

  • in.file = regular file (e.g. /usr/bin/su)
  • out.file = pipe

so the opipe branch dispatches into splice_file_to_pipe():

C
ssize_t splice_file_to_pipe(struct file *in,
                            struct pipe_inode_info *opipe,
                            loff_t *offset,
                            size_t len, unsigned int flags)
{
    ssize_t ret;

    pipe_lock(opipe);
    ret = wait_for_space(opipe, flags);
    if (!ret)
        // transition 4
        ret = do_splice_read(in, offset, opipe, len, flags);
    pipe_unlock(opipe);
    ..
}

At the transition point, do_splice_read() decides whether the operation can use the file's page cache or must fall back to a copied path:

C
static ssize_t do_splice_read(struct file *in, loff_t *ppos,
                              struct pipe_inode_info *pipe, size_t len,
                              unsigned int flags)
    {

    ...

    /*
     * O_DIRECT and DAX don't deal with the pagecache, so we allocate a
     * buffer, copy into it and splice that into the pipe.
     */
    if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
        // O_DIRECT and DAX bypass normal page cache
        // so kernel cannot perform normal page-cache-based splice operations
        // but perform an explicit buffered copy instead
        return copy_splice_read(in, ppos, pipe, len, flags);

    /* 
     * transition 5
     * normal buffered files use the file's splice_read handler
     */
    return in->f_op->splice_read(in, ppos, pipe, len, flags);
}

The important branch is at transition 5:

C
return in->f_op->splice_read(...);  

For normal buffered files, execution continues through the file's splice_read operation:

C
struct file_operations {
    ...

    ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
    ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
    ...

} __randomize_layout;

For page-cache-backed files, this last read step is the critical one, where now the current execution flow reaches

splice(fd_file, ..., fd_pipe, ...)
        |
        v
  do_splice()
        |
        v
  splice_file_to_pipe()
        |
        v
  do_splice_read()
        |
        v
  in->f_op->splice_read(...)

The generic read-only file operations table wires filemap_splice_read() as the splice_read handler:

C
const struct file_operations generic_ro_fops = {
    .read_iter    = generic_file_read_iter,
    .mmap         = generic_file_readonly_mmap,
    .splice_read  = filemap_splice_read,        // [!]
};

The comment above filemap_splice_read() states the behavior directly:

C
/*
 * filemap_splice_read -  Splice data from a file's pagecache into a pipe
 *
 * This function gets folios from a file's pagecache and splices them into the
 * pipe.
 */

Internally, it first retrieves cached folios from the file mapping, then inserts those folios into pipe buffers:

C
ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
                            struct pipe_inode_info *pipe,
                            size_t len, unsigned int flags)
{
    struct folio_batch fbatch;
    ...

    do {

        // retrieve page-cache folios from the file mapping
        error = filemap_get_pages(&iocb, len, &fbatch, true);

        ...

        for (i = 0; i < folio_batch_count(&fbatch); i++) {
            struct folio *folio = fbatch.folios[i];
            size_t n;

            ...

            /*
             * transition 6
             * insert the selected cached folio into the pipe
             */            
            n = splice_folio_into_pipe(pipe, folio, *ppos, n); 
            ...
        }

        folio_batch_release(&fbatch);
    } while (len);

    ...

    return total_spliced ? total_spliced : error;
}

The exact insertion happens in splice_folio_into_pipe():

C
/*
 * Splice subpages from a folio into a pipe.
 */
size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
                          struct folio *folio, loff_t fpos, size_t size)
{
    struct page *page;
    size_t spliced = 0, offset = offset_in_folio(folio, fpos);

    ...

    while (spliced < size &&
           !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
        // retrieve the next pipe_buffer slot inside the pipe ring buffer
        struct pipe_buffer *buf = pipe_head_buf(pipe);
        size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);

        // [!] insertion: attach file-backed page -> retrieved pipe buffer
        *buf = (struct pipe_buffer) {
            .ops    = &page_cache_pipe_buf_ops,  // [!] page cache marker
            .page   = page,
            .offset = offset,
            .len    = part,
        };
            ...
    }

    return spliced;
}

The ops field is also important. The buffer is marked with page_cache_pipe_buf_ops, not the anonymous pipe-buffer operations used by ordinary write() introduced in earlier section 3.2.2:

C
// install operations for page-cache buffer
const struct pipe_buf_operations page_cache_pipe_buf_ops = {
    .confirm    = page_cache_pipe_buf_confirm,
    .release    = page_cache_pipe_buf_release,
    .try_steal  = page_cache_pipe_buf_try_steal,
    .get        = generic_pipe_buf_get,
};

At this stage, the pipe buffer directly references a file-backed page-cache page:

file on disk (e.g. /usr/bin/su)


page cache folio/page

        │  filemap_splice_read()

pipe_buffer {
    .ops  = page_cache_pipe_buf_ops
    .page = file-backed page-cache page [!]
}

This is the first bridge Copy Fail needs. After splice(file -> pipe), the pipe is not carrying copied file bytes in anonymous pipe memory. It is carrying a pipe_buffer that still references the original cached file page.

3.2.4.2 Pipe -> Socket

The second stage is that the pipe layer can forward the same pipe_buffer into another kernel subsystem. For Copy Fail, the relevant destination is a socket send path.

The generic helper __splice_from_pipe() describes the model directly: it walks pipe buffers and lets an actor move each buffer to the destination.

C
/**
 * __splice_from_pipe - splice data from a pipe to given actor
 *
 * Description:
 *    This function does little more than loop over the pipe and call
 *    @actor to do the actual moving of a single struct pipe_buffer to
 *    the desired destination. See pipe_to_file, pipe_to_sendmsg, or
 *    pipe_to_user.
 */

For socket destinations, one concrete actor path is splice_to_socket():

C
/**
 * splice_to_socket - splice data from a pipe to a socket
 *
 * Description:
 *    Will send @len bytes from the pipe to a network socket. No data copying
 *    is involved.
 */

The comment gives the high-level idea, but the important detail is visible in the implementation: the socket payload is built from the page already carried by the pipe buffer.

C
struct bio_vec bvec[16];
struct msghdr msg = {};
...

struct pipe_buffer *buf = pipe_buf(pipe, tail);

...

bvec_set_page(
    &bvec[bc++],
    buf->page,     // page referenced by pipe_buffer 
    seg,           // byte length
    buf->offset    // offset inside that page
);

Here, struct bio_vec is a small descriptor for a page-backed byte range:

C
struct bio_vec {
    struct page *bv_page;
    unsigned int    bv_len;
    unsigned int    bv_offset;
};

So this conversion is straightforward:

pipe_buffer
    page   = cached file page
    offset = offset inside that page
    len    = visible byte range




bio_vec
    bv_page   = same cached file page
    bv_offset = same offset
    bv_len    = selected segment length

Then splice_to_socket() wraps those bio_vec entries into the socket message iterator:

C
// make msg_iter walk over bio_vec-backed pages
iov_iter_bvec(
    &msg.msg_iter,  // describes the payload being sent
    ITER_SOURCE,
    bvec,
    bc,
    len
);

msg.msg_flags = MSG_SPLICE_PAGES;  // [!] mark this send as splice-backed page data

ret = sock_sendmsg(sock, &msg);

The flag MSG_SPLICE_PAGES tells the socket send path this payload came from spliced pages rather than from an ordinary copied userspace buffer.

That gives the handoff shape:

pipe_buffer
    |
    | buf->page / buf->offset / buf->len
    v
  bio_vec
    |
    | iov_iter_bvec()
    v
msghdr.msg_iter
    |
    | MSG_SPLICE_PAGES
    v
sock_sendmsg()

The key point is that the socket consumer is not receiving a freshly copied anonymous buffer. It receives a socket message whose iterator still describes page-backed data derived from the original pipe buffer.

For Copy Fail, the chain now looks like this:

file-backed cached page
        |
        v
   pipe_buffer
        |
        v
     bio_vec
        |
        v
msg.msg_iter + MSG_SPLICE_PAGES
        |
        v
socket-side consumer

At this point, the generic primitive is established: a file-backed cached page can move from a pipe into a socket send path while still being represented as page-backed data.

3.3 AF_ALG Crypto Request Pipeline

The previous section stopped at a pipe buffer being forwarded into a socket send path. For Copy Fail, the socket consumer that matters is AF_ALG: the Linux kernel crypto socket interface.

This is where the data changes identity:

file-backed page
    |
    v
pipe buffer
    |
    v
socket message
    |
    v
AF_ALG crypto request buffer

Once the page-backed data enters AF_ALG, it can be represented as part of a crypto request scatterlist rather than as ordinary executable file data.

If you are not farmiliar with kernel socket implementation, I would suggest the learning resource:
LinuxNetworkProgramming: A comprehensive guide for Linux Network (Socket) programming

3.3.1 AF_ALG Socket Interface

AF_ALG, socket family 38, exposes part of the kernel crypto API through the socket interface.

From userspace, it does not look like a privileged ioctl or a special device node. It follows a normal socket-style workflow:

socket() 
    |
    v 
bind algorithm type/name  
    |
    v
setsockopt configuration  
    |
    v
accept operation socket  
    |
    v
sendmsg input  
    |
    v
recvmsg output

3.3.1.1 Userspace Call Pattern

A minimal AF_ALG hash example looks like this:

C
#include <stdio.h>
#include <sys/socket.h>
#include <linux/if_alg.h>
#include <unistd.h>
#include <string.h>

int main(void)
{
    /* create AF_ALG control socket */
    int tfmfd = socket(AF_ALG, SOCK_SEQPACKET, 0);

    // [!] selected algorithm
    struct sockaddr_alg sa = {
        .salg_family = AF_ALG,
        .salg_type   = "hash",
        .salg_name   = "sha256",
    };

    /* bind algorithm type/name */
    bind(
        tfmfd,                     // AF_ALG socket fd 
        (struct sockaddr *)&sa,    // sockaddr_alg 
        sizeof(sa)
    );

    /* create operation socket */
    int opfd = accept(tfmfd, NULL, 0);

    /* submit input buffer */
    send(opfd, "AAAA", 4, 0);

    /* receive resulting digest */
    unsigned char digest[32];

    recv(opfd, digest, sizeof(digest), 0);

    /* print digest */
    printf("sha256(\"AAAA\") = ");

    for (int i = 0; i < 32; i++)
        printf("%02x", digest[i]);

    printf("\n");

    close(opfd);
    close(tfmfd);

    return 0;
}
Expand

The sample computes the SHA256 digest of the input string "AAAA":

axura @ labyrinth :~
axura@pwnlab:/tmp$ gcc -o test_alg test_alg.c
axura@pwnlab:/tmp$ ./test_alg
sha256("AAAA") = 63c1dd951ffedf6f7fd968ad4efa39b8ed584f162f46e715114ee184f8de9201

3.3.1.2 Algorithm Selection

The algorithm selection itself is carried through the AF_ALG-specific socket address structure, struct sockaddr_alg:

C
struct sockaddr_alg {
    __u16 salg_family;
    __u8  salg_type[14];
    __u32 salg_feat;
    __u32 salg_mask;
    __u8  salg_name[];
};

This is the AF_ALG-specific socket address format. It plays the same structural role as:

  • sockaddr_in for IPv4
  • sockaddr_in6 for IPv6
  • sockaddr_un for Unix sockets

But instead of carrying an IP address or filesystem path, it carries crypto selection data:

The two important user-controlled strings are:

  • salg_type: which AF_ALG interface family should handle this socket
  • salg_name: which concrete crypto algorithm inside that family should be instantiated

In the Copy Fail path, userspace supplies this pair:

C
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))",

The first string decides which AF_ALG family handles the socket. The second decides which kernel crypto algorithm is instantiated inside that family.

userspace

    │ socket(AF_ALG, SOCK_SEQPACKET, 0)

    │ bind(
    │   type = "aead",
    │   name = "authencesn(hmac(sha256),cbc(aes))"
    │ )

kernel AF_ALG layer

    │ resolve family + algorithm

AEAD operation socket

    │ sendmsg() / recvmsg()

selected AEAD implementation

The authencesn decrypt path becomes important later in 3.4. For now, the key point is that an ordinary userspace bind() routes the socket into the AEAD family and selects the exact transform used by the exploit.

3.3.1.3 AF_ALG Family Resolution

At that point, userspace has already prepared an AF_ALG control socket, and calls bind() with an initialized struct sockaddr_alg:

C
struct sockaddr_alg sa = {
    .salg_family = AF_ALG,
    .salg_type   = "aead",
    .salg_name   = "authencesn(hmac(sha256),cbc(aes))",
};

bind(tfmfd, (struct sockaddr *)&sa, sizeof(sa));

The generic socket layer eventually reaches __sys_bind(), which dispatches through the socket operations table:

C
err = READ_ONCE(sock->ops)->bind(
    sock,
    (struct sockaddr *)&address,
    addrlen
);

For an AF_ALG transform socket, that handler is alg_bind(). This is where the generic AF_ALG layer receives the algorithm type and name from the created sockaddr_alg object in last step:

C
type = alg_get_type(sa->salg_type);  // resolve AF_ALG interface family

...

private = type->bind(
    sa->salg_name,
    sa->salg_feat,
    sa->salg_mask
);

For Copy Fail:

C
sa->salg_type = "aead"

so the kernel resolves the registered AF_ALG family named "aead".

The resolved object is a struct af_alg_type, the family dispatch table:

C
struct af_alg_type {
    /* instantiate concrete algorithm by name */
    void *(*bind)(const char *name, u32 type, u32 mask);

    /* release family-private algorithm state */
    void (*release)(void *private);

    /* configure key material */
    int (*setkey)(void *private, const u8 *key, unsigned int keylen);

    /* configure AEAD authentication tag size */
    int (*setauthsize)(void *private, unsigned int authsize);

    /* create accepted operation socket */
    int (*accept)(void *private, struct sock *sk);
    int (*accept_nokey)(void *private, struct sock *sk);

    /* socket operations exposed by accepted operation socket */
    struct proto_ops *ops;
    struct proto_ops *ops_nokey;

    struct module *owner;

    /* AF_ALG family name, e.g. "aead" */
    char name[14];
};

For the AEAD family used in Copy Fail, this resolves to algif_type_aead:

C
static const struct af_alg_type algif_type_aead = {
    .bind          = aead_bind,
    .release       = aead_release,

    .setkey        = aead_setkey,
    .setauthsize   = aead_setauthsize,

    .accept        = aead_accept_parent,
    .accept_nokey  = aead_accept_parent_nokey,

    .ops           = &algif_aead_ops,
    .ops_nokey     = &algif_aead_ops_nokey,

    .name          = "aead",
    .owner         = THIS_MODULE
};

Now the earlier generic bind call becomes concrete:

type->bind(...)
        |
        v
aead_bind("authencesn(hmac(sha256), cbc(aes))", ...)

So the resolution chain is:

bind(tfmfd, sockaddr_alg)
        |
        v
   alg_bind()
        |
        | salg_type = "aead"
        v
alg_get_type("aead")
        |
        v
algif_type_aead
        |
        | salg_name = "authencesn(hmac(sha256),cbc(aes))"
        v
aead_bind(...)

The same family object also controls later operations trhough the algif_type_aead object:

setsockopt(..., ALG_SET_KEY, ...)
        |
        v
aead_setkey()


setsockopt(..., ALG_SET_AEAD_AUTHSIZE, ...)
        |
        v
aead_setauthsize()


accept(tfmfd, ...)
        |
        v
aead_accept_parent()


sendmsg(opfd, ...) / recvmsg(opfd, ...)
        |
        v
algif_aead_ops

That last field matters because the accepted operation socket exposes the AEAD-specific socket operations table struct algif_aead_ops:

C
static struct proto_ops algif_aead_ops = {
    ...
    .sendmsg = aead_sendmsg,
    .recvmsg = aead_recvmsg,
    ...
};

So the full bridge is:

userspace bind()
        |
        v
generic AF_ALG bind handler
        |
        v
resolve family by salg_type
        |
        v
instantiate algorithm by salg_name
        |
        v
install family-specific behavior for later 
setsockopt(), accept(), sendmsg(), recvmsg()

For Copy Fail, this means a normal userspace socket setup is enough to route later sendmsg() and recvmsg() calls into the AEAD request path selected by authencesn(hmac(sha256),cbc(aes)).

3.3.1.4 Control Socket and Operation Socket Split

AF_ALG separates transform configuration from request I/O.

The socket returned by socket(AF_ALG, ...) is the control socket:

C
// request AF_ALG socket returning control socket tfmfd
tfmfd = socket(AF_ALG, SOCK_SEQPACKET, 0)  

bind(tfmfd, ...)
setsockopt(tfmfd, ...)

Then the bind(tfmfd,...) and setsockopt(tfmfd,...) calls select and configure the corresponding crypto transform.

Actual request traffic happens on a second socket created by accept():

C
int opfd = accept(tfmfd, NULL, 0);  // accept() returns operation socket

Inside the kernel, af_alg_accept() this operation socket and installs the operation table from the resolved struct af_alg_type mentioned in previous section:

C
const struct af_alg_type *type;

/*
 * newsock->ops assigned here to allow type->accept call to override
 * them when required.
 */
newsock->ops = type->ops;

For the AEAD family, type->ops points to algif_aead_ops:

C
static struct proto_ops algif_aead_ops = {
    ...
    .sendmsg = aead_sendmsg,
    .recvmsg = aead_recvmsg,
    ...
};

So after accept() returns, the fd used by userspace is already wired to AEAD-specific I/O:

control socket
════════════════════════════════════════

tfmfd
  ├─ bind()       -> select algorithm
  └─ setsockopt() -> configure transform


operation socket
════════════════════════════════════════

accept(tfmfd)


opfd
  ├─ sendmsg() -> aead_sendmsg()
  └─ recvmsg() -> aead_recvmsg()

For Copy Fail, the vulnerable data path is reached through the operation socket.

3.3.1.5 AEAD Request Submission

The minimal hash example in 3.3.1.1 can use plain send() and recv() because a hash request is simple:

Bash
input bytes -> digest bytes

AEAD requests carry more structure. A single request needs:

  • operation direction: encrypt or decrypt
  • IV: nonce / initialization vector
  • AAD length: associated-data boundary
  • payload bytes: plaintext or ciphertext
  • tag: the authentication tag

That is why the AEAD path uses sendmsg(): the payload travels through msg->msg_iter, while request metadata travels through control messages attached to the same msghdr.

The AEAD operation table routes sendmsg(opfd, ...) into aead_sendmsg():

C
static int aead_sendmsg(
    struct socket *sock,
    struct msghdr *msg,   
    size_t size)
{
    return af_alg_sendmsg(sock, msg, size, ivsize);
}

aead_sendmsg() is only the AEAD wrapper. The shared request builder is af_alg_sendmsg().

First, it parses control messages from the user-supplied msghdr:

C
int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
                   unsigned int ivsize)
{

    ...

    while ((cmsg = af_alg_cmsg_send(msg, &con)) != NULL) {
        switch (cmsg->cmsg_type) {

        /* encrypt/decrypt selector */
        case ALG_SET_OP:
            ctx->op = *(u32 *)CMSG_DATA(cmsg);
            break;

        /* AEAD associated-data (AAD) length */
        case ALG_SET_AEAD_ASSOCLEN:
            /*
             * Number of bytes at the start of the input
             * that should be treated as AAD.
             */
            ctx->aead_assoclen = af_alg_control_aead(cmsg);
            break;

        ...

This brings in an important term frequently used in the following context, AAD (AEAD associated-data). So when we see variables named like aead_assoclen in the kernel source, we should understand it refers to the length of AAD.

Then it converts the payload from msg->msg_iter into scatter-gather state:

C
iov_iter_extract_will_pin(&msg->msg_iter);
len = af_alg_make_sg(&ctx->sgl, &msg->msg_iter, 0, len);

The request assembly looks like this:So the request assembly is:

sendmsg(opfd, msghdr)


aead_sendmsg()


af_alg_sendmsg()

        ├─ control messages
        │      ├─ operation direction
        │      ├─ IV
        │      └─ AAD length

        └─ msg_iter payload


        scatter-gather buffers

So by the end of sendmsg(), the request is no longer just userspace socket input. It has been converted into kernel-side crypto state:

  • metadata → stored in ctx
  • payload → represented by scatter-gather buffers

That is the bridge Copy Fail needs. The later bug depends on what backs those scatter-gather buffers (will be introduced in 3.3.3): ordinary userspace memory, or page-backed data imported through the zero-copy splice path (see 3.2.3).

3.3.2 AEAD Buffer Contract

Before we reach the final scatter-gather (scatterlist) representation, we need to understand the logical AEAD buffer contract. The kernel does not immediately treat the queued bytes as an arbitrary scatterlist; it first interprets them as an AEAD request with a strict layout:

AAD || payload || optional tag

— this is a logical layout, not an adjacent structure on a contiguous memory. It decides the buffer boundary that we care about in memory corruption.

Therefore, after sendmsg() queues request bytes and records ctx->aead_assoclen, recvmsg() has to interpret those bytes according to that AEAD contract:

  • AAD is authenticated but not encrypted
  • payload is plaintext for encryption, ciphertext for decryption
  • tag is produced during encryption and consumed during decryption

So the layout depends on the request direction:

encrypt:
    input  = AAD || plaintext
    output = AAD || ciphertext || tag

decrypt:
    input  = AAD || ciphertext || tag
    output = AAD || plaintext

That split appears directly in _aead_recvmsg():

C
static int _aead_recvmsg(
    struct socket *sock,
    struct msghdr *msg,
    size_t ignored,
    int flags)
{
    ...

    /* AEAD authentication tag size */
    unsigned int as = crypto_aead_authsize(tfm);

    ...

    /*
     * Total bytes queued earlier through sendmsg().
     *
     * Encrypt input:
     *     AAD || plaintext
     *
     * Decrypt input:
     *     AAD || ciphertext || tag
     */
    used = ctx->used;

    /*
     * outlen is the size of the recvmsg-side output buffer needed.
     *
     * Encrypt:
     *     output = AAD || ciphertext || tag
     *     outlen = input + tag
     *
     * Decrypt:
     *     output = AAD || plaintext
     *     outlen = input - tag
     */
    if (ctx->enc)
        outlen = used + as;  // 1 encryption
    else
        outlen = used - as;  // 2 decryption

    /*  
    * Rebase "used" from total input length to crypto payload length.  
    *  
    * AAD is not encrypted/decrypted, so it is removed here and stored  
    * separately as req->assoclen through aead_request_set_ad().  
    *  
    * After this:  
    *  
    * Encrypt:  
    * used = plaintext length  
    *  
    * Decrypt:  
    * used = ciphertext || tag length  
    */
    used -= ctx->aead_assoclen;

    ...
}

The comments give the two pieces of bookkeeping we care about:

  • outlen → recvmsg-side output size
  • used → AEAD crypto payload length after removing AAD

For Copy Fail, the decrypt case is the critical one:

AEAD decrypt contract
════════════════════════════════════════════════════════════

input queued through sendmsg():
┌────────────┬────────────────────┬────────────┐
│    AAD     │     ciphertext     │    tag     │
└────────────┴────────────────────┴────────────┘
0        assoclen          assoclen+ctlen     +authsize

                  decrypts to      verifies only
                       │                 │
                       ▼                 ×

output prepared for recvmsg():
┌────────────┬────────────────────┐
│    AAD     │     plaintext      │
└────────────┴────────────────────┘
0        assoclen          assoclen+ptlen


                         valid output boundary

The tag is part of the decrypt input because the AEAD algorithm needs it for authentication. But it is not part of the decrypt output.

That boundary matters later in 3.3.4.3: the valid decrypt output ends after AAD || plaintext, while the original decrypt input still had a tag after AAD || ciphertext. Copy Fail becomes possible when the request construction preserves that tag tail after the output region and a later algorithm-side write crosses into it.

3.3.3 Scatterlist Through Scatterwalk

At the AEAD level, the logical buffer boundary is now clear. The next question is why a helper can read or write past one backing region and continue into another.

The answer is the crypto layer does not require one flat contiguous buffer. The walker (scatterwalk) commonly operates on a scatterlist chain: a list of page-backed byte ranges treated as one continuous logical stream.

3.3.3.1 Logical View

The crypto layer may see one logical buffer:

logical crypto buffer
════════════════════════════════════════════════════════════

0                                                        end
│               [logical byte stream]                     │
▼                                                         ▼
┌───────────────┬──────────────────┬──────────────────────┐
│      AAD      │    ciphertext    │         tag          │
└───────────────┴──────────────────┴──────────────────────┘

But underneath, that stream can be backed by several independent scatterlist entries:

scatterlist backing
════════════════════════════════════════════════════════════

┌───────────────┐ ┌───────────────┐ ┌──────────────────────┐
│ sg entry #0   │ │ sg entry #1   │ │ sg entry #2          │
│ user page     │ │ user page     │ │ file-backed page     │
└───────┬───────┘ └──────┬────────┘ └────────┬─────────────┘
        │                │                   │
        ▼                ▼                   ▼
┌───────────────┬──────────────────┬──────────────────────┐
│      AAD      │    ciphertext    │         tag          │
└───────────────┴──────────────────┴──────────────────────┘

An "sg entry" means one struct scatterlist entry. The critical idea is that the crypto layer can treat these separate backing regions as one logical byte stream.

3.3.3.2 Scatterlist Entry Layout

The core object is struct scatterlist:

C
struct scatterlist {
    unsigned long page_link;  // encoded page pointer + flags
    unsigned int  offset;     // byte offset inside the page
    unsigned int  length;     // number of bytes available
    dma_addr_t    dma_address;
    ...
};

Conceptually, one scatterlist entry means:

this logical range lives at: page + offset for length bytes

So a scatterlist entry is not a malloc-style buffer. It is a descriptor for a byte range inside a page.

The page attachment is made explicit by sg_set_page():

C
/**
 * sg_set_page - Set sg entry to point at given page
 * @sg:         SG entry
 * @page:     The page
 * @len:     Length of data
 * @offset:     Offset into page
 *
 * Description:
 *   Use this function to set an sg entry pointing at a page, never assign
 *   the page directly. We encode sg table information in the lower bits
 *   of the page pointer. See sg_page() for looking up the page belonging
 *   to an sg entry.
 *
 **/
static inline void sg_set_page(struct scatterlist *sg, struct page *page,
                               unsigned int len, unsigned int offset)
{
    sg_assign_page(sg, page);  // attach backing page
    sg->offset = offset;       // start offset inside page
    sg->length = len;          // valid byte length
}

3.3.3.3 Scatterlist Chaining

Multiple entries can be connected through sg_chain() reaching __sg_chain():

C
#define SG_CHAIN    0x01UL
#define SG_END        0x02UL


static inline void sg_chain(
    struct scatterlist *prv,
    unsigned int prv_nents,
    struct scatterlist *sgl)
{
    __sg_chain(&prv[prv_nents - 1], sgl);
}


static inline void __sg_chain(struct scatterlist *chain_sg,
                              struct scatterlist *sgl)
{
    /*
     * offset and length are unused for chain entry. Clear them.
     */
    chain_sg->offset = 0;
    chain_sg->length = 0;

    /*
     * Set lowest bit to indicate a link pointer, and make sure to clear
     * the termination bit if it happens to be set.
     */
    chain_sg->page_link = ((unsigned long) sgl | SG_CHAIN) & ~SG_END;
}

The resulting shape is:

scatterlist chain
═══════════════════════════════════

┌────────────────────────────┐
│ sg #0                      │
│ page   = user page A       │
│ offset = 0x120             │
│ length = 0x300             │
└──────────────┬─────────────┘
               │ sg_next()

┌────────────────────────────┐
│ sg #1                      │
│ page   = user page B       │
│ offset = 0x000             │
│ length = 0x700             │
└──────────────┬─────────────┘
               │ sg_next()

┌────────────────────────────┐
│ sg #2                      │
│ page   = file-backed page  │
│ offset = 0x200             │
│ length = 0x010             │
└────────────────────────────┘

A consumer of this chain can walk from sg #0 into sg #1, then into sg #2, as if all entries formed one continuous buffer, while they are actaully "scattered".

3.3.3.4 Scatterwalk Mechanics

That is where scatterwalk enters the picture.

scatterwalk_map_and_copy() copies bytes into or out of a scatterlist chain starting at a logical offset:

C
void scatterwalk_map_and_copy(void *buf, struct scatterlist *sg,
                      unsigned int start, unsigned int nbytes, int out)
{
    struct scatter_walk walk;
    struct scatterlist tmp[2];

    // jump to the sg entry containing logical offset "start" */
    sg = scatterwalk_ffwd(tmp, sg, start);

    // start walking from that entry 
    scatterwalk_start(&walk, sg);

    // copy nbytes across one or more sg entries
    scatterwalk_copychunks(buf, &walk, nbytes, out);

        scatterwalk_done(&walk, out, 0);
}

The helper does not operate against one flat allocation. It fast-forwards to a logical offset, then copies across the scatterlist stream.

When the current entry is exhausted, the walker can continue into the next entry. The transition happens through scatterwalk_done()scatterwalk_pagedone():

C
static inline void scatterwalk_done(struct scatter_walk *walk, int out,
                                    int more)
{
    /*
     * Finish the current page/chunk when:
     *
     * - there is no more data to copy, or
     * - the current scatterlist entry is exhausted, or
     * - the walk reached a page boundary
     */
    if (!more ||
        walk->offset >= walk->sg->offset + walk->sg->length ||
        !(walk->offset & (PAGE_SIZE - 1)))
        scatterwalk_pagedone(walk, out, more);
}

static inline void scatterwalk_pagedone(struct scatter_walk *walk, int out,
                                        unsigned int more)
{
    /*
     * If data was written into the backing page,
     * flush the data cache for coherency.
     */
    if (out) {
        struct page *page;

        page = sg_page(walk->sg) +
               ((walk->offset - 1) >> PAGE_SHIFT);

        flush_dcache_page(page);
    }

    /*
     * If more bytes remain and the current scatterlist entry
     * has been fully consumed, move to the next entry:
     *
     *     current sg entry -> sg_next(current sg entry)
     */
    if (more &&
        walk->offset >= walk->sg->offset + walk->sg->length)
        scatterwalk_start(walk, sg_next(walk->sg));  // [!]
}

The important line is:

C
scatterwalk_start(walk, sg_next(walk->sg));

Meaning:

current sg entry

        │ sg_next()

next sg entry

        │ scatterwalk_start()

copy continues there

So if one entry ends and another entry is chained after it, scatterwalk can keep moving.

sg #0                         sg #1
┌────────────────────────────┐ ┌────────────────────────────┐
│ AAD || plaintext           │ │ tag / scratch / next bytes │
└────────────────────────────┘ └────────────────────────────┘

                             └── sg_next(sg #0) 

3.3.3.5 Output-boundary Crossing

This is the property Copy Fail needs.

From the helper's point of view, the target is only:

sg = logical offset + length inside a scatterlist chain

It does not inherently know that one entry is legitimate decrypt output while the next entry may be a borrowed pipe/page-cache-backed page.

The dangerous shape is:

destination scatterlist
══════════════════════════════════════════════════════════════

valid output area                  sg_next() chained entry
┌────────────────────────────┐    ┌────────────────────────────┐
│ sg #0                      │    │ sg #1                      │
│ legitimate AEAD output     │───▶│ page-cache-backed file page│
│ AAD || plaintext           │    │ must not receive writes    │
└────────────────────────────┘    └────────────────────────────┘


                   valid output ends here

For AEAD decryption, the valid output boundary is:

AAD length (assoclen) + plaintext length

The authentication tag belongs to the input:

AAD || ciphertext || tag

but not to the output:

AAD || plaintext

So if crypto code writes past the valid decrypt output boundary, scatterwalk can mechanically follow sg_next() into the next scatterlist entry. It does not know that sg #0 is legitimate output while sg #1 may be a page-cache-backed file page.

That is why scatterlists matter here: they turn separate backing regions into one logical byte stream, and the walker follows that stream mechanically.

So if we want to exploit that logical buffer with an overwrite, the question becomes:

Look into those scatterlist write primitives, and will there be "overflow" bugs triggered by the kernel?

If yes, and if the next chained entry is page-cache-backed, that write can stop being ordinary buffer handling and become an overflow-style page-cache corruption primitive.

3.3.4 AEAD Request Scatterlist Construction

3.3.4.1 Socket SGLs and Crypto Request SGLs

The previous section established that scatterwalk can traverse a scatterlist chain. The missing link is the builder:

Which code constructs the scatterlist chain consumed by the AEAD implementation?

That bridge is _aead_recvmsg(), reached through the AEAD operation socket created by accept().

The kernel source file algif_aead.c describes the model in terms of two socket-side scatterlists, TX and RX:

C
/*
 * ...
 *
 * The following concept of the memory management is used:
 *
 * The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
 * filled by user space with the data submitted via sendmsg (maybe with
 * MSG_SPLICE_PAGES).  Filling up the TX SGL does not cause a crypto operation
 * -- the data will only be tracked by the kernel. Upon receipt of one recvmsg
 * call, the caller must provide a buffer which is tracked with the RX SGL.
 *
 * During the processing of the recvmsg operation, the cipher request is
 * allocated and prepared. As part of the recvmsg operation, the processed
 * TX buffers are extracted from the TX SGL into a separate SGL.
 *
 * ...
 */

The socket-side meaning is:

  • TX SGL:
    • transmit-side scatterlist
    • data queued through sendmsg()
    • may include MSG_SPLICE_PAGES-backed entrie (see 3.2.4.2)
  • RX SGL:
    • receive-side scatterlist
    • destination buffer supplied by recvmsg() for the operation result

During recvmsg(), _aead_recvmsg() translates those socket buffers into a crypto-layer request object struct aead_request:

C
struct aead_request {
    struct crypto_async_request base;

    unsigned int assoclen;    /* AAD length */
    unsigned int cryptlen;    /* payload length */

    u8 *iv;                   /* IV / nonce */

    struct scatterlist *src;  /* input scatterlist */
    struct scatterlist *dst;  /* output scatterlist */

    void *__ctx[] CRYPTO_MINALIGN_ATTR;
};

The naming transition is:

AF_ALG socket layer                  crypto request layer
───────────────────                  ────────────────────

TX SGL  ─────────────────────────▶   req->src
data submitted by sendmsg()          input scatterlist

RX SGL  ─────────────────────────▶   req->dst
buffer supplied to recvmsg()         output scatterlist

So _aead_recvmsg() is the conversion point:

sendmsg() data    ─►  socket TX SGL  ─►  req->src
recvmsg() buffer  ─►  socket RX SGL  ─►  req->dst

The decrypt path is the dangerous one (will be introduced in 3.3.4.3). The authentication tag must remain available as input through req->src, but it is not part of the valid decrypt output in req->dst. That mismatch makes the later scatterlist chaining security-sensitive.

3.3.4.2 Encryption Request Layout

For encryption, the AEAD contract is:

input  = AAD || plaintext  
output = AAD || ciphertext || tag

Inside _aead_recvmsg(), the encryption branch first copies the queued TX input into the RX destination scatterlist:

C
if (ctx->enc) {
    /*
     * Encryption operation - The in-place cipher operation is
     * achieved by the following operation:
     *
     * TX SGL: AAD  ||  PT
     *          |       |
     *          | copy  |
     *          v       v
     * RX SGL: AAD  ||  PT  ||  Tag
     */

    /*
     * Step 1:
     * Copy AAD || plaintext from TX into RX.
     *
     * At this point, RX contains the input material and has
     * enough space for the tag that will be produced later.
     */
    err = crypto_aead_copy_sgl(
        null_tfm,
        tsgl_src,                        // source: queued TX input
        areq->first_rsgl.sgl.sgt.sgl,    // destination: RX output
        processed                        // bytes copied into RX
    );

    ...

    /*
     * Step 2:
     * Consume the TX entries that were copied.
     */
    af_alg_pull_tsgl(sk, processed, NULL, 0);

} else ...

At this stage, RX contains AAD || plaintext and also has room for the authentication tag. The tag is not generated by af_alg_pull_tsgl() at step 2; it is produced later when the prepared request is submitted through crypto_aead_encrypt():

C
/*
 * Later:
 * submit the prepared AEAD request.
 *
 * In the encryption case, the algorithm transforms plaintext into
 * ciphertext and writes the authentication tag into the RX destination.
 */
crypto_aead_encrypt(&areq->cra_u.aead_req);

Conceptually:

encryption sgl setup
══════════════════════════════════════════════════════

TX SGL                           RX SGL
┌─────────┬───────────┐  copy   ┌───────┬───────────┬───────────┐
│   AAD   │ plaintext │ ──────▶ │  AAD  │ plaintext │ tag space │
└─────────┴───────────┘         └───────┴───────────┴───────────┘
                                   │           later becomes
                                   │          │           │ 
                                   │          │           │ written by
                                   │          │ encrypt   │ crypto_aead_encrypt()
                                   │          ▼           ▼
                                   │         ciphertext  tag
                                   │                       ▲   
                                   ▼                       │
                                  AAD                      │
                                   │                       │
                                   └──── authenticated ────┘

So the encryption path is straightforward: copy AAD || plaintext from TX into RX, consume the copied TX entries, then let the AEAD algorithm produce ciphertext || tag in the RX destination.

3.3.4.3 Decryption Request Layout

Decryption is the security-sensitive case.

For decryption, the input layout is:

AAD || ciphertext || tag

but the legitimate output layout is only:

AAD || plaintext

The tag is required for authentication, but it must not be emitted as output. That creates the layout problem: the crypto request still needs the tag as input, while the destination should end before the tag.
The decrypt branch in _aead_recvmsg() solves this by copying the output-sized head into RX, preserving the TX tag tail, and chaining that tag tail after RX:

C
} else {
    /*
     * Decryption operation - To achieve an in-place cipher
     * operation, the following SGL structure is used:
     *
     * TX SGL: AAD  ||  CT || Tag
     *          |       |      ^
     *          | copy  |      | Create SGL link.
     *          v       v      |
     * RX SGL: AAD  ||  CT ----+
     */

    /*
     * Step 1:
     * Copy only the decrypt output-sized prefix from TX into RX.
     *
     * TX contains: AAD || CT || Tag
     * RX receives only: AAD || CT
     *
     * The tag is intentionally not copied into RX output.
     */
    err = crypto_aead_copy_sgl(
        null_tfm,
        tsgl_src,                        // source: queued TX input
        areq->first_rsgl.sgl.sgt.sgl,    // destination: RX output 
        outlen                           // copy only AAD || CT 
    );

    if (err)
        goto free;

    /*
     * Step 2:
     * Count how many TX scatterlist entries cover the tag tail.
     *
     * processed      = total TX input length
     * processed - as = start offset of authentication tag
     */
    areq->tsgl_entries = af_alg_count_tsgl(
        sk,
        processed,
        processed - as
    );

    ...

    /*
     * Step 3:
     * Preserve the tag tail from the TX.
     *
     * The tag remains available as decrypt input, but it is not
     * copied into the RX output buffer.
     */
    af_alg_pull_tsgl(
        sk,
        processed,
        areq->tsgl,     // stores preserved TX tag entries
        processed - as  // start of tag region
    );

    ...

    /*  
    * Step 4:  
    * Chain the preserved TX tag entries after the RX scatterlist.  
    *  
    * Before:  
    *  
    * RX SGL: [ AAD || CT ]  
    *  
    * TX SGL: [ AAD || CT || Tag ]  
    *  ^  
    *  |  
    *  preserved tag tail  
    *  
    * After sg_chain():  
    *  
    * request SGL:  
    *  
    * [ RX: AAD || CT ] ---> [ TX: Tag ]  
    *  
    * The tag remains input material, but it is now linked after  
    * the RX-side scatterlist as part of one logical chain.  
    */
    sg_chain(
        sg,                       // RX scatterlist segment to extend
        sgl_prev->sgt.nents + 1,  // number of RX entries including chain slot
        areq->tsgl                // preserved TX tag scatterlist
    );
}
Expand

The three relevant values are:

processed = total queued TX input length
as        = AEAD authentication tag size
outlen    = processed - as

So for decryption:

outlen = AAD length + ciphertext length
       = valid decrypt output - sized prefix

The preserved tag begins at:

processed - as

That means _aead_recvmsg() builds two pieces:

RX head
    recvmsg() destination
    contains output-sized prefix:
    AAD || ciphertext

TX tag tail
    preserved sendmsg() tail
    contains authentication tag

After sg_chain(), those pieces are no longer isolated. The RX scatterlist is extended so that walking past its end reaches the preserved TX tag entries:

decrypt request scatterlist
═════════════════════════════════════════════════════════════════════

valid output-sized head                 preserved input tail
┌────────────────────────────┐         ┌────────────────────────────┐
│ RX SGL                     │         │ TX tag SGL                 │
│ recvmsg() destination      │ ──────▶ │ authentication tag bytes   │
│ AAD || ciphertext          │ sg_next │ from sendmsg()             │
└────────────────────────────┘         └────────────────────────────┘

During actual decryption, the AAD || ciphertext region becomes the valid AAD || plaintext output. The chained TX tag tail remains input material for authentication:

semantic decrypt layout
═════════════════════════════════════════════════════════════════════

valid decrypt output                   preserved input-only tag
┌────────────────────────────┐         ┌────────────────────────────┐
│ AAD || plaintext           │ ──────▶ │ tag                        │
│ belongs to RX output       │ sg_next │ belongs to TX input        │
└────────────────────────────┘         └────────────────────────────┘

This design is intentional: AEAD decryption still needs the tag, but the tag is not part of the output returned to userspace.

The security-sensitive part is the chain boundary. Once RX and the preserved TX tag tail are chained, a scatterlist walker can cross from the valid output region into the tag entry:

            valid decrypt output ends


┌────────────────────┐         ┌──────────────────────────┐
│ RX SGL             │───────▶ │ preserved TX tag SGL     │
│ AAD || plaintext   │ sg_next │ authentication tag bytes │
└────────────────────┘         └──────────────────────────┘
                        ^
                      can we overflow from here?

At this point, chaining alone is not the bug. The chain becomes dangerous only if later code performs a destination-side write at the boundary where the valid decrypt output ends — that is the buggy AEAD implementation authencesn introduced in the next 3.4 section.

3.3.4.4 Final AEAD Request Wiring

Before entering the selected AEAD implementation, _aead_recvmsg() stores the prepared scatterlist layout into the final struct aead_request.

Two helpers do the wiring.

aead_request_set_crypt() assigns the input/output scatterlists, payload length, and IV:

C
static inline void aead_request_set_crypt(struct aead_request *req,
                                          struct scatterlist *src,
                                          struct scatterlist *dst,
                                          unsigned int cryptlen, u8 *iv)
{
    req->src = src;
    req->dst = dst;
    req->cryptlen = cryptlen;
    req->iv = iv;
}

aead_request_set_ad() stores the AAD length:

C
static inline void aead_request_set_ad(struct aead_request *req,
                                       unsigned int assoclen)
{
    req->assoclen = assoclen;
}

In _aead_recvmsg(), the final request is initialized like this:

C
/* Initialize the crypto operation */
aead_request_set_crypt(
    &areq->cra_u.aead_req,          // AEAD request object being prepared

    rsgl_src,                       // source SGL:
                                    // logical AEAD input
                                    // aka: AAD || ciphertext || tag

    areq->first_rsgl.sgl.sgt.sgl,   // destination SGL:
                                    // recvmsg-side output buffer
                                    // aka: AAD || plaintext

    used,                           // cryptlen:
                                    // crypto payload length excluding AAD
                                    // aka: ciphertext || tag length

    ctx->iv                         // IV / nonce for AEAD operation
);

aead_request_set_ad(
    &areq->cra_u.aead_req,          // same AEAD request object
    ctx->aead_assoclen              // assoclen: length of AAD prefix
);

So the assignment is:

&areq->cra_u.aead_req         -> req

rsgl_src                      -> req->src
areq->first_rsgl.sgl.sgt.sgl  -> req->dst

used                          -> req->cryptlen
ctx->iv                       -> req->iv
ctx->aead_assoclen            -> req->assoclen

_aead_recvmsg() builds an in-place decrypt request:

  • req->src:
    • AEAD input
    • AAD || ciphertext || tag
  • req->dst:
    • AEAD output
    • AAD || plaintext

To make that work, _aead_recvmsg() reuses the RX head and links the preserved TX tag tail after it:

AEAD decrypt request
════════════════════════════════════════════════════════════

                  shared RX head                 preserved TX tail
             ┌──────────────────────┐          ┌────────────────┐
req->src ──▶ │ AAD || ciphertext    │ ───────▶ │      tag       │
             └──────────────────────┘          └────────────────┘


req->dst ────────────────┘
             writes output here:
             AAD || plaintext

So the same RX head participates in two views:

req->src reads:
    RX head || TX tag tail
    AAD || ciphertext || tag

req->dst writes:
    RX head only
    AAD || plaintext

Visually:

The valid decrypt output ends inside the RX head:

RX head
┌────────────┬────────────────────┐
│    AAD     │     plaintext      │
└────────────┴────────────────────┘


                       valid output boundary

But the chained request still has a next entry after that boundary:

RX head                         preserved TX tail
┌────────────┬────────────────┐ ┌────────────┐
│    AAD     │ plaintext area │ │    tag     │
└────────────┴────────────────┘ └────────────┘
                              ▲      ▲
                              │      │
                valid output ends   reachable through sg_next()

That is the key idea: _aead_recvmsg() hands the next AEAD layer a normal-looking struct aead_request, but internally the request carries a chained scatterlist layout. The valid decrypt output is supposed to stop at AAD || plaintext, while the authentication tag remains reachable after that boundary as preserved input material.

This layout is not automatically a bug. It becomes dangerous only if a later AEAD implementation performs a destination-side write at the decrypt output boundary. In that case, scatterwalk may follow the chain into the preserved tag entry. If that entry is backed by a pipe/page-cache page, the write can become a page-cache overwrite primitive.

That is why the next layer matters. The selected AEAD implementation, authencesn(), decides whether this chained request layout stays harmless or turns into a boundary-crossing scratch write.

3.4 Authencesn Decrypt Path

Now we know the AEAD request has two views:

  • semantic output view: output stops after AAD || plaintext
  • scatterlist stream view: RX head is followed by the preserved TX tag tail

At this point, the layout itself is not the bug. The critical question is whether the selected AEAD implementation performs a destination-side write at the boundary where the valid decrypt output ends.

That means we want an OOB write on a logical buffer in attacker perspective.

For Copy Fail, userspace selected:

C
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))",

So the next step is to follow the prepared decrypt request into the selected authencesn implementation.

3.4.1 AEAD Decrypt Callback Dispatch

After _aead_recvmsg() prepares the AEAD request, it submits the operation according to the direction recorded earlier through sendmsg():

C
ctx->enc ? crypto_aead_encrypt(&areq->cra_u.aead_req) :
           crypto_aead_decrypt(&areq->cra_u.aead_req);

For Copy Fail, the relevant branch is decryption through crypto_aead_decrypt():

C
/**
 * crypto_aead_decrypt() - decrypt ciphertext
 * @req: reference to the aead_request handle that holds all information
 *     needed to perform the cipher operation
 *
 * Decrypt ciphertext data using the aead_request handle. That data structure
 * and how it is filled with data is discussed with the aead_request_*
 * functions.
 *
 * ...
 */
int crypto_aead_decrypt(struct aead_request *req);

crypto_aead_decrypt() is still generic crypto API glue. It resolves the concrete AEAD transform from the request, finds the registered algorithm callbacks, and dispatches into the selected decrypt implementation:

C
int crypto_aead_decrypt(struct aead_request *req)
{
    /*
     * Resolve the concrete AEAD transform from the request.
     *
     * In this case, the transform was selected earlier through:
     *     salg_name = "authencesn(hmac(sha256),cbc(aes))"
     */
    struct crypto_aead *aead = crypto_aead_reqtfm(req);

    /*
     * Resolve the algorithm implementation behind that transform.
     *
     * This gives access to the registered callbacks:
     *     alg->encrypt
     *     alg->decrypt
     */
    struct aead_alg *alg = crypto_aead_alg(aead);

    ...

    /*
     * Dispatch into the concrete AEAD decrypt implementation.
     *
     * For authencesn(hmac(sha256),cbc(aes)), this becomes:
     *     crypto_authenc_esn_decrypt(req)
     */
    else
        ret = alg->decrypt(req);  // [!] transition

    return ret;
}

For the selected transform:

C
.salg_name = "authencesn(hmac(sha256),cbc(aes))"

the crypto core instantiates the authencesn AEAD template. During setup, crypto_authenc_esn_create() wires the callbacks:

C
inst->alg.encrypt = crypto_authenc_esn_encrypt;
inst->alg.decrypt = crypto_authenc_esn_decrypt;

So the generic dispatch collapses into:

crypto_aead_decrypt(req)
        |
        v
alg->decrypt(req)
        |
        v
crypto_authenc_esn_decrypt(req)

The full path from the operation socket is:

recvmsg(opfd, ...)
        |
        v
_aead_recvmsg()
        |
        | prepared struct aead_request
        |   req->src
        |   req->dst
        |   req->cryptlen
        |   req->assoclen
        |   req->iv
        v
crypto_aead_decrypt(req)
        |
        v
alg->decrypt(req)
        |
        v
crypto_authenc_esn_decrypt(req)

No overwrite has happened yet. This section only proves the dispatch path:

AF_ALG recvmsg()
        |
        v
generic AEAD request submission
        |
        v
authencesn decrypt callback

Now the target is precise after the AEAD request being disected in 3.3.4.4: inspect what crypto_authenc_esn_decrypt(req) does with req->src, req->dst, req->assoclen, and req->cryptlen.

3.4.2 The Destination-Side Scratch Writes

Inside crypto_authenc_esn_decrypt(), the decrypt path reads from and writes into the request scatterlists through scatterwalk_map_and_copy() (for walker mechanism see 3.3.3.4).

The relevant pattern in crypto_authenc_esn_decrypt() where frequently calls the scatterlist walker:

C
static int crypto_authenc_esn_decrypt(struct aead_request *req)
{
    unsigned int authsize = crypto_aead_authsize(authenc_esn);
    unsigned int assoclen = req->assoclen;
    unsigned int cryptlen = req->cryptlen;
    struct scatterlist *dst = req->dst;    // [!] dst is a chained scatterlist
    u32 tmp[2];

    /*
     * req->cryptlen originally includes:
     *
     *     ciphertext || tag
     *
     * After this subtraction, cryptlen means:
     *
     *     ciphertext length only
     */
    cryptlen -= authsize;

    ...

    /*
     * Read the authentication tag from:
     *
     *     req->src at logical offset assoclen + cryptlen
     *
     * Direction flag out = 0 means:
     *     scatterlist -> local buffer
     */
    scatterwalk_map_and_copy(
        ihash,
        req->src,
        assoclen + cryptlen,
        authsize,
        0
    );

    /*
     * Read the first 8 bytes of dst into tmp.
     */
    scatterwalk_map_and_copy(
        tmp,
        dst,
        0,
        8,
        0  // read flag: dst -> tmp
    );

    /*
     * Scratch write #1:
     * write tmp[0] into dst at logical offset 4.
     */
    scatterwalk_map_and_copy(
        tmp,
        dst,
        4,
        4,
        1  // write flag: tmp -> dst @ 4
    );

    /*
     * Scratch write #2:
     * write tmp[1] into dst at logical offset:
     *
     *     assoclen + cryptlen
     */
    scatterwalk_map_and_copy(
        tmp + 1,
        dst,
        assoclen + cryptlen,
        4,
        1  // write flag: tmp+1 -> dst @ assoclen+cryptlen
    );

    ...
}
Expand

The last argument to scatterwalk_map_and_copy() is the direction flag:

  • out = 0 → read from scatterlist into local buffer
  • out = 1 → write from local buffer into scatterlist

So these two calls are destination-side writes:

C
scatterwalk_map_and_copy(tmp, dst, 4, 4, 1);
scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);  // [!] bug

The value flow is already visible here:

first 8 bytes of dst
    dst[0:8]


tmp[0] = AAD[0:4]
tmp[1] = AAD[4:8]


tmp + 1 becomes the source of the Scratch write #2

Because dst starts at the RX head, and the RX head begins with AAD, this becomes:

tmp[0] = AAD[0:4]
tmp[1] = AAD[4:8]

So AAD[4:8] is not magic. It is first read into tmp[1], then later written back through the tmp + 1 scatterwalk write.

Before treating that second write as dangerous, we need to understand why authencesn performs this shuffle at all.

3.4.3 Authencesn ESN Shuffle

authencesn exists for IPsec Extended Sequence Number handling. In this mode, the associated data begins with sequence-number material split into two 32-bit halves:

AAD prefix
┌────────────┬────────────┐
│ seqno_hi   │ seqno_lo   │
│ 4 bytes    │ 4 bytes    │
└────────────┴────────────┘

During decrypt, authencesn() temporarily rearranges this material before authentication. The three scatterwalk_map_and_copy() calls from the previous section implement that shuffle:

C
/* Move high-order bits of sequence number to the end. */
scatterwalk_map_and_copy(tmp, dst, 0, 8, 0);                        // read
scatterwalk_map_and_copy(tmp, dst, 4, 4, 1);                        // scratch write #1
scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);  // scratch write #2 [!]

Read as operations:

  1. read 8 bytes from logical offset 0 in dst into tmp
  2. write 4 bytes back into dst at logical offset 4
  3. write 4 bytes into dst at logical offset assoclen + cryptlen

Visually the read/write operations are:

ESN scratch-write layout
════════════════════════════════════════════════════════════

Step 1: read first 8 bytes from dst into tmp

dst starts at RX head
┌────────────┬────────────┬────────────────────┐
│ AAD[0:4]   │ AAD[4:8]   │ ciphertext / data  │
└────────────┴────────────┴────────────────────┘
      │            │
      ▼            ▼
┌────────────┬────────────┐
│ tmp[0]     │ tmp[1]     │
│ AAD[0:4]   │ AAD[4:8]   │
└────────────┴────────────┘

             │ as the source of write



Scratch write #1
════════════════

scatterwalk_map_and_copy(tmp, dst, 4, 4, 1)

dst
┌────────────┬────────────┬────────────────────┐
│ AAD[0:4]   │ write here │ ciphertext / data  │
└────────────┴────────────┴────────────────────┘


     logical offset 4
     local AAD-area write tmp[0] == AAD[0:4] 


Scratch write #2
════════════════

scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1)

dst logical stream
┌────────────┬────────────────────┐ ┌────────────────────┐
│    AAD     │ plaintext/output   │ │ chained tag entry  │
└────────────┴────────────────────┘ └────────────────────┘


                       offset assoclen + cryptlen
                       write tmp[1] == AAD[4:8]

└─────────────── sgl chained by sg_next() ────────────────┘
Expand

The first write is local, staying near the AAD/ESN area:

tmp[0] -> dst @ 4  

The second write is the pivot:

tmp[1] -> dst @ assoclen + cryptlen  

Here tmp[1] comes from AAD[4:8], but the destination is no longer a fixed ESN-local offset. It is a calculated AEAD boundary.

If dst were one flat private output buffer, this might still look like ordinary scratch space. But dst is a scatterlist chain, and _aead_recvmsg() may place a preserved TX tag entry after the valid RX output head.

So the value side is already clear:

AAD[4:8] -> tmp[1] -> scratch-write source

The next question is the destination side — the chained tag sg entry:

Where does logical offset assoclen + cryptlen inside dst actually land?

That boundary is what we dissect next.

3.4.4 Decrypt-boundary Write Offset

That final scratch write above is the critical one:

C
scatterwalk_map_and_copy(
    tmp + 1,              // source: AAD[4:8] staged in tmp[1]
    dst,                  // target scatterlist
    assoclen + cryptlen,  // logical destination offset
    4,                    // 4-byte write
    1                     // write flag: tmp+1 -> scatterlist
);

The important detail is that crypto_authenc_esn_decrypt() first removes the authentication tag from cryptlen:

cryptlen -= authsize;

So by the time the scratch write runs, cryptlen no longer means ciphertext || tag. It means ciphertext length only.

Therefore, the argument, start offset, for that 2nd scratch write becomes:

assoclen + cryptlen

points to the boundary immediately after:

AAD || ciphertext

For AEAD decryption, ciphertext and plaintext have the same length, so this is also the boundary after the legitimate output:

assoclen + ciphertext_len == assoclen + plaintext_len  

The boundary looks like this:

AEAD decrypt boundary
══════════════════════════════════════════════════════════════

decrypt input:
┌────────────┬────────────────────┬────────────┐
│    AAD     │     ciphertext     │    tag     │
└────────────┴────────────────────┴────────────┘
0        assoclen         assoclen+cryptlen


                                   preserved tag



valid decrypt output:                   ▼
┌────────────┬────────────────────┐ ┌────────────┐
│    AAD     │     plaintext      │ │ chained tag│
└────────────┴────────────────────┘ └────────────┘
0        assoclen          assoclen+plaintext_len


                         valid output ends here


                    authencesn 4-byte scratch write starts here

└────────── sgl chained by sg_next() ───────────┘

As established in 3.3.4.4, the final decrypt request can be built as a scatterlist chain: a valid RX output head followed by a preserved TX tag tail.

Those entries are not physically contiguous, and they are not adjacent virtual-memory regions in the normal userspace sense. But to scatterwalk, they form one logical byte stream. Once the walker reaches the end of one scatterlist entry, it can continue through sg_next() into the next entry.

That is why the scratch write has an overflow-style shape:

  • value: AAD[4:8]
  • write source: tmp + 1
  • write destination: dst @ assoclen + cryptlen
  • semantic meaning: exact end of valid decrypt output
  • scatterlist meaning: continue into the next chained entry if one exists

More precisely:

authencesn writes AAD[4:8] as 4 bytes at the end of the valid decrypt output region, and scatterwalk can carry that write into the next chained scatterlist entry.

At this point, we have proven the algorithm-side write. We have not yet proven that the next chained entry is page-cache-backed. That bridge comes from the splice path: a file-backed page can enter the AEAD TX side through the pipe-to-socket handoff (see 3.2.4).

Chapter 4 puts those pieces together into the full exploit chain (details starts from 4.3).