How the Linux Page Cache Actually Works (End to End)

I keep referencing the page cache in writeups about kernel bugs (most recently Copy Fail), and every time I do, I run into the same problem. The page cache sounds like it should be a side effect of the filesystem layer, something that "happens automatically." It is not. It is one of the central data structures in the kernel, almost every read on Linux goes through it, and a surprising amount of security depends on understanding how it works.

So this is the writeup of the page cache I wanted to read. Anatomy, the read path, the write path, the kernel's physmap alias, eviction, the tools for inspecting it, and at the end the reason it keeps showing up in CVE writeups.

If you want only the headline: the page cache is the thing that makes "free memory" on a healthy Linux box look almost full. It is RAM that the kernel has filled with pieces of files because the alternative is going back to disk every time anyone reads anything. It is shared across every process on the host. It has no awareness of containers, namespaces, or who originally caused a page to be cached. And the kernel can write to any page in it at any time, regardless of what permissions userspace has on the underlying file.

That last sentence is most of the security story.

1. The 30-second version

Every file the kernel has been asked to read recently is partially or wholly resident in RAM. The mechanism that holds those bytes is the page cache. It is keyed on (superblock, inode, offset). When userspace reads a file, the kernel looks up the page in this cache. If it is there, it is returned without touching disk. If it is not there, the kernel allocates a new physical page, asks the block layer to fill it from disk, and inserts it into the cache before returning. Subsequent reads of the same offset, by the same process or any other process, hit the cache.

The page cache is not a copy of disk. It is the canonical state of a file as far as anything running on the system is concerned. If the cache disagrees with disk, the cache wins until it is evicted or written back. This is true even when the disagreement was caused by a kernel bug.

free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi        4.5Gi       1.2Gi       300Mi        25Gi         26Gi

That 25Gi of "buff/cache" is mostly page cache. It is not unused memory. It is RAM holding files. It will be reclaimed if anything else needs it.

2. Anatomy

Three structures do most of the work. None of them are exotic, but they fit together in a way that is easier to see in a picture than in prose.

struct page. One of these exists for every 4 KiB physical page in the system. The kernel keeps them in an array (mem_map) so it can go from physical frame number to struct page in O(1). The struct page carries the page's flags, refcount, mapping pointer, and a few overloaded fields that mean different things depending on what the page is being used for at the moment. For our purposes, the important fields are mapping (back-pointer to the inode that owns this cached page) and index (the offset within the file, in PAGE_SIZE units).

struct address_space. One of these per inode. It is the per-file index of cached pages. It contains an xarray (formerly a radix tree) keyed on file offset that maps offsets to struct page *. It also has the dirty-page tracking, writeback hooks, and pointers to the filesystem's address_space_operations (which define how to do readpage, writepage, etc.).

LRU lists. Two per memory zone, "active" and "inactive." Pages get moved between them based on access patterns. When the kernel needs to reclaim memory, it walks the inactive list, kicks pages out, and frees them.

That diagram is the steady-state picture for a single open file. Multiply by every open inode on the system and you have the page cache.

One nuance worth flagging if you're reading current kernel source. Over the last few cycles, Matthew Wilcox has been leading a migration from struct page to struct folio. A folio represents a contiguous, naturally aligned run of pages (usually one 4 KiB page, but cleanly handling transparent huge pages and compound pages). The xarray inside address_space now stores struct folio * rather than struct page * for the subsystems that have been converted. The migration is incremental and not every code path is folio-native yet, so both terms appear in the tree depending on which file you open. For our purposes nothing in this writeup changes. Wherever I say struct page, mentally substitute "the folio that contains this page" if the surrounding code has been folio-converted. The structural picture (per-inode index, LRU eviction, physmap alias) is identical.

3. The read path · cache miss

A read() syscall on a file that nothing has touched recently is the long path. End to end, it looks like this.

The process calls read(fd, buf, n). The kernel resolves fd to a struct file, then to the inode, then to the inode's address_space. It computes which file offsets the read covers, in PAGE_SIZE units, and for each one looks up mapping->i_pages[offset]. None of the entries are present, because nothing has cached this file yet. Cache miss.

For each missing offset, the kernel allocates a fresh physical page via the buddy allocator. It puts the new struct page into the inode's xarray at the right offset, with PG_locked set so other readers will wait. Then it calls the filesystem's readpage (or readahead) operation, which queues a block I/O request to read the right sectors off disk into that page.

The block layer translates "read these sectors" into actual driver-level commands. For NVMe that is a submission-queue entry pointing at a DMA address. For SATA it is a SCSI command. Either way, the disk eventually DMAs bytes into the physical page the kernel allocated. The block layer fires an interrupt, the kernel notices the I/O completed, marks the page PG_uptodate, clears PG_locked, and wakes any process waiting on it.

Now the kernel can finally do the syscall's actual work. It maps the page (briefly, on 32-bit) or just uses the physmap address (on 64-bit), calls copy_to_user to copy bytes into the user's buffer, and returns the byte count to the caller.

This is the slow path. It happens once per offset per file, until the page gets evicted.

4. The read path · cache hit

The hot path is much shorter, and it's where the page cache earns its keep.

Same syscall, same lookup, same mapping->i_pages[offset] query. This time the entry is present and PG_uptodate. The kernel does not allocate, does not call the filesystem, does not touch the block layer at all. It calls copy_to_user from the page-cache page directly into the user's buffer. Done.

On a warm cache, read() is essentially memcpy plus syscall overhead. Tens of nanoseconds, not tens of microseconds. Multiply by the millions of read()s a busy server does per second and the page cache is most of why Linux is fast.

The hit path is also why "free memory" on a healthy Linux box hovers near zero. The kernel has no incentive to keep RAM unused. Empty RAM is wasted RAM. Fill it with cached pages, and if anything genuinely needs memory later, evict.

5. mmap vs read

These two syscalls look like they do the same thing but they don't. The difference matters for both performance and security.

read(fd, buf, n) calls copy_to_user to copy bytes from the page-cache page into the user's buffer. There is a copy. The user gets their own bytes that are independent of the cache page from then on.

mmap(addr, n, PROT_READ, MAP_PRIVATE, fd, off) does not copy. It installs page table entries (PTEs) in the calling process's address space that point at the page cache page directly. When the process reads from the mapping, the CPU translates the virtual address through the page table to the page cache's physical page and the user is reading the cache directly.

This is a big deal:

mmap reads do not pay a copy cost.
Multiple processes that mmap the same file share the same physical pages. RSS does not double.
If the page cache page changes underneath them (because the kernel modified it, or because writeback brought in fresh bytes), every mmaped reader sees the change immediately. There is nothing to invalidate.

Point 3 is the property kernel bugs like Copy Fail abuse. If the kernel can be tricked into writing to a page-cache page, every mmap of that file sees the corruption instantly, without any explicit invalidation step.

splice() is a third sibling worth mentioning. It moves page references between file descriptors without copying bytes. When you splice() from a file fd into a pipe, the pipe ends up holding pointers to the same struct pages the file's page cache holds. Splicing those out the other end of the pipe into a different fd hands those same struct pages to the destination. This is how zero-copy networking and a number of fast-path syscalls work. It is also how Copy Fail gets page-cache pages into the kernel crypto API.

6. The write path

When userspace writes, the page cache complicates things. Now there is a question of when the write hits disk.

A write(fd, buf, n) looks up the target pages the same way read does. If a target page is not present, the kernel may need to read it from disk first (read-modify-write for partial writes), or it may allocate a new uninitialized page (full-page writes). It then copy_from_users the user's bytes into the page-cache page, marks the page dirty (PG_dirty), and updates the inode's writeback tracking.

Disk is not touched yet.

The write returns to userspace as soon as the page cache has the bytes. From userspace's point of view, the write is done. From disk's point of view, nothing has happened.

Eventually one of three things flushes the dirty page back to disk:

Periodic writeback. Per-CPU kworker threads (formerly pdflush) wake up at regular intervals, walk the dirty page lists, and queue writeback I/O for pages older than a threshold (/proc/sys/vm/dirty_expire_centisecs).
Pressure. If the system is filling up with dirty pages (/proc/sys/vm/dirty_ratio), the kernel forces writebacks synchronously to keep the dirty count under control.
Explicit sync. fsync(fd), msync(addr), or sync() from userspace forces writeback for the named file or globally.

Until one of those happens, your write is sitting in RAM. If the machine loses power, the data is gone. This is why databases and journaling filesystems care so much about fsync.

7. The kernel's view: physmap

Userspace sees a file's pages through PTEs. Those PTEs always have the protections the userspace mapping declared, typically PROT_READ for executable files. The kernel sees those exact same physical pages through a completely different mapping.

On 64-bit Linux, the kernel maintains a "direct map" or "physmap" that covers all of physical RAM in kernel virtual address space at a fixed offset. On x86_64 it lives at PAGE_OFFSET = 0xffff888000000000 (give or take, KASLR moves it). Every physical page has a kernel virtual address you can compute as phys + PAGE_OFFSET. The protection is PAGE_KERNEL, which is read-write executable from kernel mode.

This means the kernel always has a writable alias of every physical page, including every page-cache page, including pages backing executable files that userspace can only see read-only. Whenever a kernel function does memcpy(dst, src, n) and dst is computed from a struct page via kmap_local_page or the physmap, the kernel is writing through that always-RW alias. No userspace permission is ever consulted, because there is no userspace involvement.

This is structurally fine. The kernel has to be able to write to physical RAM somehow. The page cache could not work otherwise: writeback, page filling, copy-on-write, all of them require the kernel to be able to mutate pages. The trust boundary is "kernel code that touches a page-cache page knows what it's doing."

Bugs that violate that trust boundary are the interesting kernel CVEs.

8. Eviction

Pages do not stay cached forever. The kernel needs to be able to give RAM back when something else asks for it. The mechanism is reclaim, and the policy is roughly LRU.

Every cached page sits on either the active or inactive LRU list of its memory zone. When a page is accessed, it gets a "young" mark (PTE young bit, or the page's accessed flag). Periodically the kernel walks the lists, ages pages, and moves them between active and inactive based on whether they were touched recently.

When something needs free memory (the buddy allocator can't satisfy a request, or a watermark is breached), reclaim kicks in. The reclaim path walks the inactive list, picks pages, and evicts them. For a clean page (not dirty), eviction is just removing it from the xarray and freeing the physical page back to the buddy allocator. For a dirty page, the kernel has to writeback first, then evict.

The user-visible knobs are:

/proc/sys/vm/swappiness: how aggressively to evict anonymous pages vs file-backed cache.
/proc/sys/vm/vfs_cache_pressure: bias toward evicting dentry/inode caches vs page cache.
/proc/sys/vm/drop_caches: write 1 to drop clean page cache, 2 to drop slab dentries/inodes, 3 for both. Test environments only, not for production.

For our security story the relevant fact is: dropping caches makes any page-cache corruption disappear. The next read of the file is a cache miss, so it goes back to disk, and disk has the original bytes. Reboot does the same thing for free.

9. Inspecting it from userspace

You can see most of the page cache without root, and all of it with root. The useful tools:

# How big is the page cache right now?
grep -E "Cached|Buffers" /proc/meminfo

# Which pages of a specific file are currently cached?
sudo apt install vmtouch
vmtouch /usr/bin/su

# Force a specific file into the page cache
vmtouch -t /usr/bin/su

# Pin a file into the cache (won't be evicted while held)
vmtouch -l /usr/bin/su

# Drop everything (root only, dev box only)
echo 3 | sudo tee /proc/sys/vm/drop_caches

The mincore() syscall is the underlying mechanism. Given a virtual address range, it returns a byte per page indicating whether that page is currently resident. vmtouch is essentially a wrapper around mmap plus mincore.

For per-process visibility, /proc/<pid>/smaps shows each VMA and how much of it is resident. /proc/<pid>/pagemap (root only) maps a process's virtual pages to physical frame numbers, and from there you can correlate with /proc/kpageflags to see what flags each physical page has set.

For a system-wide view, slabtop shows kernel slab caches (dentries, inodes, buffer_head structs that the page cache uses), and vmstat 1 lets you watch reclaim activity in real time (si/so for swap, bi/bo for block I/O).

10. Why this matters for kernel security

Most of the alarming kernel CVEs of the last decade have been page cache bugs in some way. The shape is consistent.

The page cache is shared. A process's read or mmap of a file reaches the same physical pages as every other process's read or mmap of that file. There is no copy-per-consumer. Container namespaces do not partition it. The kernel's writable alias of those pages is always present.

So any kernel code path that ends in "and then memcpy into a page-cache page" is potentially a write primitive against every reader of that file, on the host or in any sibling container. If an attacker can drive that code path with controlled bytes and an attacker-chosen target page, they have an LPE, and depending on whether the target file is host-shared or container-local, they have a container escape.

A short list:

CVE-2016-5195 (Dirty Cow): copy-on-write race that lets an unprivileged user write to a private mapping of a read-only file by tricking the kernel into mapping the page-cache page as the COW destination. Page-cache write primitive via mm/gup.
CVE-2022-0847 (Dirty Pipe): pipe buffer code reused without clearing a flag, letting splice-supplied page-cache pages inherit pipe-buffer write permissions. Page-cache write primitive via the pipe layer.
CVE-2026-31431 (Copy Fail): AEAD output scatterlist aliased to a splice-supplied source scatterlist, so the AEAD transform writes its scratch into page-cache pages. Page-cache write primitive via the kernel crypto API.

Three different subsystems, one shape. The bug is always "the kernel writes to a page-cache page on behalf of an unprivileged caller, and that caller can influence which page and what bytes." The page cache itself is doing exactly what it's supposed to. The bug is whoever asks the page cache to hold attacker-influenced content.

The defensive lesson, for system operators, is that "the file is read-only, so it's safe" is wrong reasoning on Linux. As long as the file can be cached, any kernel bug that writes to a page-cache page can change what that file looks like to executions in the next few hundred milliseconds. dm-verity helps for the boot chain. IMA-appraise helps for files that have been measured. AIDE/Tripwire help for after-the-fact detection if the corruption persists past a writeback. None of these catch a page-cache-only mutation that gets exploited and then evicted.

The defensive lesson, for kernel developers, is the seam between subsystems is where these bugs live. The page cache is fine. The crypto API is fine. splice() is fine. The combination is what produced Copy Fail. Reviewers see one subsystem at a time. Reviewing across seams is a different skill, and is increasingly something AI-assisted code review is good at, which is why the disclosure cadence for this exact bug shape is going to keep accelerating.

Closing

The page cache is a cache the way a refrigerator is a kitchen. It is not a side feature. It is most of why Linux feels fast. It is the primary residence of every file the system has touched. It is shared across every process and every container on the host. The kernel can write to any page in it for legitimate reasons every microsecond, and there is no capability check between that write and the userspace mappings of the same page.

Once you know that, kernel CVE writeups read differently. Most of the time the bug isn't really in the subsystem the disclosure names. It's in how that subsystem ends up holding the kernel's pen over a page-cache page that an attacker shouldn't be able to influence.