Exploiting CVE-2021-3490 for Container Escapes

Published 04/05/2023

Originally published by CrowdStrike.

Today, containers are the preferred approach for deploying software or creating build environments in CI/CD lifecycles. However, since the emergence of container solutions and environments like Docker and Kubernetes, security researchers have consistently found ways to escape from containers once they are compromised. Most attacks are based on configuration errors. But it is also possible to escalate privileges and escape to the container’s host system by exploiting vulnerabilities in the host’s operating system.

This blog shows how to modify an existing Linux kernel exploit in order to use it for container escapes.

Original Technique

Before we outline the modifications required to turn the exploit into a container escape, we first look at what the original exploit achieved.

Valentina Palmiotti published a full exploit for CVE-2021-3490 that can be used to locally escalate privileges to root on affected systems. The vulnerability was rooted in the eBPF subsystem of the Linux kernel and fixed in version 5.10.37. eBPF allows user space processes to load custom programs into the kernel and attach them to so-called events, thus giving user space the ability to observe kernel internals and, in specifically supported cases, to implement custom logic for networking, access control and other tasks. These eBPF programs have to pass a verifier before being loaded, which is supposed to guarantee that the code does not contain loops and does not write to memory outside of its dedicated area. This step should ensure that eBPF programs terminate and are not able to manipulate kernel memory, which would potentially allow attackers to escalate privileges. However, this verifier contained several vulnerabilities in the past. CVE-2021-3490 is one of them and can ultimately be used to achieve a kernel read and write primitive.

Building on the kernel read primitive, it is possible to leak a kernel pointer. eBPF programs can communicate with processes running in user space using so-called “eBPF maps.” Every eBPF map is described by a struct bpf_map object, which contains a field ops pointing to a struct bpf_map_ops. That struct contains several function pointers for working with the eBPF map. eBPF maps come in different kinds with different definitions for ops stored at known offsets. For array maps, ops will be set to point to the kernel symbol array_map_ops. The exploit will leak that address and then use it as a starting point to further scan the kernel’s memory space and read pointers from the kernel’s symbol table.

The kernel exports pointers to certain variables, objects and functions in a symbol table to make them accessible by kernel modules. This table is called ksymtab. In order to look up the actual name of a stored symbol address, a second table, called kstrtab, is utilized. A pointer to the string in kstrtab that contains the name is stored as part of every ksymtab entry, right after the pointer to the symbol itself. To find the address of a kernel symbol, the exploit first reads memory from kernel space starting at the leaked address of array_map_ops using the arbitrary read primitive. This is done until the string containing the symbol name of interest is found in kstrtab.

Because kstrtab is mapped after ksymtab, the previously read memory region should contain the pointer to the string in kstrtab. Therefore, the exploit then proceeds to search for that pointer, and the pointer to the actual symbol is stored right before it.

One clarification has to be made about the above code excerpt: There are two different formats of ksymtab. Which one is used is decided when the Linux kernel is built from source by the configuration parameter HAVE_ARCH_PREL32_RELOCATIONS. In one case, the actual addresses of the symbols and kstrtab entries are stored. However, in many kernel builds, it does not actually contain pointers but offsets such that the address where the offset is stored plus the offset itself is the symbol’s address or the string’s address in kstrtab.

Nevertheless, using this technique, the exploit identifies the address of init_pid_ns. This object is a struct pid_namespace and describes the default process ID namespace new processes are started in.

Namespaces have become a fundamental feature of Linux and are crucial to the idea of container environments. They allow separating system resources between different processes such that one process can observe a completely different set than others. For example, mount namespaces control observable filesystem mount points such that two processes can have different views of the filesystem. This allows a container’s filesystem to have a different root directory than the host. Process ID namespaces on the other side give processes a completely unique process tree. The first process in a process ID namespace always has the identifier (PID) 1. It is considered as the init process that initializes the operating system and from which new processes originate. Therefore, if this process is stopped, all other processes in the particular process ID namespace are stopped as well.

By identifying init_pid_ns, it is possible to enumerate all struct task objects of the processes running in that namespace as those are stored in a traversable radix tree in the field idr. The exploit identifies the correct task object by its PID. Those task objects store a pointer to a struct cred object that contains the UID and GID (user and group identifier) associated with the process and therefore holds the granted permissions. By overwriting the cred object of a process, it is possible to escalate privileges by setting the UID and GID to 0, which is associated with the root user.

However, this approach does not work if a container was compromised and the attacker’s goal is to escape into the container’s host environment.

Why This Doesn’t Work in Containers

Linux kernel exploits are an alternative method to escape container environments to the host in case no mistakes in the container configuration were made. They can be used because containers share the host’s kernel and therefore its vulnerabilities, regardless of the Linux distribution the container is based on. However, exploit developers have to pay attention to some obstacles compared to privilege escalation outside of container environments.

First, container solutions are able to restrict the capabilities of processes running inside a container. For example, the capability SYS_ADMIN is normally not granted to processes running in containers, which can therefore not mount file systems or execute various other privileged actions. Moreover, it is possible to restrict the set of syscalls a userland process can call by utilizing seccomp. For example, in the default configuration of Docker, an exploit would not be able to use eBPF at all. Nevertheless, in the default Kubernetes configuration, seccomp does not restrict the available syscalls at all. For the remainder of this post, though, we will assume that the container is configured such that eBPF could be used by userland processes.

Second, on a more practical note, the techniques of the original exploit described above will not work out-of-the-box. As already described, containers rely heavily on namespaces. Because containers typically have their own associated process ID namespace, it is not as straightforward to identify the exploit process running in the container by its PID, because, for example, the exploit may have PID 42 from the container’s perspective but PID 1337 from the host’s perspective. However, the parent namespace can still observe all processes running in child namespaces. Therefore, those processes have a PID in both parent and child namespace. Ultimately, the initial process ID namespace described by init_pid_ns can observe any process running on a particular system. Nevertheless, even if we identify the task structure of our exploit process within a container, overwriting its cred object as described previously will simply elevate privileges within the container but not allow container escape.

Changes for Container Escapes

It is possible to modify the exploit so that a container escape is conducted and privileges are escalated to root on the host. To easily find the exploit process in a container, an exploit can search for the symbol current_task and pcpu_base_addr symbols in ksymtab. current_task stores the offset to the running process’s task object based on the address stored in pcpu_base_addr. Because pcpu_base_addr is unique per CPU core, the process must be pinned before on one core using the sched_setaffinity() system call.

Using this technique, it is possible to identify the correct task object without traversing the radix tree of all processes stored in init_pid_ns.

This allows the attacker to overwrite the correct cred object and therefore obtain root privileges. Due to the usage of namespaces, the observable file system is still that of the container, though. Nevertheless, it is possible to overcome that obstacle as well. The task object contains a pointer to a struct fs_struct object. This object contains information about the observable file system, i.e., which directory is considered as the processes’ file system root. Using the leaked pointer to init_pid_ns, it is possible to traverse the process radix tree and identify the host’s init process, which has PID 1. Next, it is possible to retrieve the fs pointer from this process’s task object. Lastly, while overwriting the cred object of the exploit process, the fs pointer must be overwritten as well using the init process’s fs pointer. The exploit process can then observe the complete host file system.

One last addition must be made. As stated above, containers normally have limited capabilities. Capabilities are used to restrict the permissions of processes running in containers. To obtain full privileges, the exploit also has to overwrite the capabilities mask of the exploit’s process in the task object. How exactly the values must be set to obtain full capabilities without any restrictions can be investigated in the definition of the init process’ credentials.

The technique described in this blog to identify the task object of the exploit’s process only works on the Linux kernel version before 5.15, as pcpu_base_addr is no longer exported as a symbol to ksymtab. Nevertheless, alternative methods exist to to find the correct task object, e.g., by traversing the radix tree of all processes from init_pid_ns and matching on features of the exploit process other than the PID, such as the comm member of struct task that contains the executable name.

Container Escape Mitigations

Detecting this and similar exploits is very hard as they are data-only and misuse only legitimate system calls. As a defense-in-depth strategy, the following steps can be taken to harden Linux hosts and container environments to prevent exploitation of CVE-2021-3490 and future attacks.

Upgrade the kernel version. With a critical kernel vulnerability like CVE-2021-3490, it is paramount that available fixes are applied by upgrading the kernel version.
Provide only required capabilities to the container. By limiting the capabilities of the container, the root account of the container becomes limited in its capabilities, which significantly reduces the chances of container escape and exploitation of kernel vulnerabilities. For example, to exploit the CVE-2021-3490 using the described technique, the attacker needs CAP_BPF or CAP_SYS_ADMIN granted. Note that privileged containers have those capabilities. Therefore, you should monitor your environment for such containers with Cloud Workload Protection.
Use a seccomp profile. While Kubernetes does not apply a seccomp profile without configuration, Docker’s default seccomp profiles protect against a number of dangerous system calls that can help attackers to break out of the container environment. Correct seccomp profiles can help significantly reduce the container attack surface. CVE-2021-3490 requires the bpf system call to exploit the vulnerability, which is blocked in Docker’s default seccomp profile. Hence, exploitation of CVE-2021-3490 in a container environment using a strong seccomp profile would fail.
Monitor the host and containerized environment for a breach. In case a privileged workload or a host is compromised by attackers, the organization needs state-of-the-art monitoring and detection capabilities to prevent and detect advanced persistent threats (APTs), eCrime and nation-state actors.

Conclusion

Container technology is a good solution to separate and fine-tune resources to different processes. However, while existing solutions add another layer of security due to the restriction of capabilities and available syscalls, the available attack surface inside a container still contains the host’s kernel. Every eased restriction — for example, allowing the use of eBPF — will increase the attack surface. If a threat actor is able to take advantage of a vulnerability inside the host’s kernel and an exploit is available, the host can be compromised, regardless of other security layers and restrictions such as namespaces.

This blog showed exactly that: Not much effort is needed to turn a full exploit chain for a local privilege escalation into one that is able to escape containers as well. The basic rules of network hygiene (patch early and often) not only apply to containers but to the hosts that deploy those in a cloud environment as well. Moreover, solutions such as Docker and Kubernetes can reduce the attack surface drastically if configured properly.

Application Containers and Microservices Common Vulnerability and Exposures Vulnerabilities