『杂项』高效文本分析：使用 Hyperscan 进行正则匹配

admin

102360
文章

87
评论

2024年1月16日09:07:38评论18 views字数 9606阅读32分1秒阅读模式

点击蓝字关注我们

日期：2024.01.05

作者：H4y0

介绍：在 Windows 平台下使用 WSL 编译安装 Hyperscan，并使用 Hyperscan 进行敏感数据的提取。

0x00 前言

数据算法题目中常用正则表达式提取数据。Hyperscan在处理大规模数据时对比其他正则表达式引擎有着更卓越的性能，但使用相对麻烦，本文记录在使用Hyperscan环境搭建过程中踩过的坑并展示几个测试demo。

0x01 环境配置

Hyperscan支持在Windows平台下编译安装，在使用过程中遇到一些问题，想在Ubuntu中进行任务，但Hyperscan的使用与CPU息息相关，VMware虚拟化CPU会对使用造成影响，WSL似乎成为了一个相对优秀的选择。

1.1 WSL配置

WSL的安装不多赘述，推荐使用WSL1，可与VMware共存且不影响其性能，具体原因可查找相关资料。安装好WSL后进行如下配置：

安装Ubuntu：建议前往官网https://learn.microsoft.com/en-au/windows/wsl/install-manual下载Ubuntu 20.04 LTS，可以将系统安装在非系统盘，减少资源占用

下载后将appx修改为zip并解压，即可运行ubuntu.exe。
使用VScode连接WSL：在VScode中安装WSL拓展，并通过左下角打开远程窗口连接，即可连接到WSL，并使用其他拓展进行程序编写。

1.2 Hyperscan编译安装

具体的编译安装过程可参考官方文档或搜索Ubuntu系统安装Hyperscan相关内容，下面是安装过程中遇到的问题及解决办法：

PCRE版本不符：

出现这个问题的原因是因为系统存在自带的libpcre,可通过以下命令查看其版本信息。

pkg-config --modversion libpcre

通过手动安装高版本pcre库可解决版本不符问题，安装最新版本Hyperscan可安装pcre-8.4.5，安装完成后再次使用命令查看，系统libpcre版本未发生变化，需手动复制相关文件进行替换。

cp /usr/lib/pkgconfig/libpcre.pc /usr/lib/x86_64-linux-gnu/pkgconfig/libpcre.pc

缺少某库：

如遇到提示checking for module 'xxxx' package 'xxxx' not found，按提示安装指定版本号的库即可，安装完成后仍报错，同理将安装好的文件移动到pkgconifg目录即可。

无法引用库：

成功编译安装Hyperscan后，在编写程序时使用#include <hs.h>调用库报错，提示hs.h: No such file or directory，可通过在编译过程中指定头文件路径来解决，如下：

gcc xxx.c -I/usr/local/include/hs/ -L/usr/local/lib -lhs -o xxx

0x02 应用案例

你已经拥有村里最好的剑了，快去战胜魔王吧。

『杂项』高效文本分析：使用 Hyperscan 进行正则匹配

2.1 helloworld

匹配hello,world中的hello，展示使用hyperscan进行正则匹配的基本流程，具体可见代码注释部分。

#include <hs.h>#include <stdio.h>#include <string.h>#include <stdlib.h>

// 匹配事件处理函数即回调函数，hyperscan总是通过回调函数来处理匹配到的内容static int eventHandler(unsigned int id, unsigned long long from,                        unsigned long long to, unsigned int flags, void *ctx) {    printf("在位置 %llu 处匹配到模式 "%u"n", to, id);    return 0; // 返回0继续匹配}

/*id ： 编译时指定的正则表达式标识符from ：匹配的起始位置偏移量，仅在（编译阶段）设置了返回起始位置偏移量的情况下有效to ： 匹配的结束位置偏移量flags ：预留标记context ： 用户提供给 hs_scan(), hs_scan_vector() 或 hs_scan_stream() 函数的指针。*/

int main() {    hs_database_t *database;    hs_compile_error_t *compile_err;    hs_scratch_t *scratch = NULL;

    // 编译正则表达式，可修改hs_compile函数实现不同规则    if (hs_compile("hello", HS_FLAG_DOTALL, HS_MODE_BLOCK, NULL, &database, &compile_err) != HS_SUCCESS) {        fprintf(stderr, "错误: 无法编译模式 "%s": %sn", "hello", compile_err->message);        hs_free_compile_error(compile_err);        return 1;    }

    // 分配scratch空间    if (hs_alloc_scratch(database, &scratch) != HS_SUCCESS) {        fprintf(stderr, "错误: 无法分配scratch空间。退出。n");        hs_free_database(database);        return 1;    }



/*函数原型hs_error_t hs_compile(    const char *expression,    unsigned int flags,    hs_compile_mode mode,    const hs_platform_info_t *platform,    hs_database_t **db,    hs_compile_error_t **error);*/

/*说明expression：要编译的正则表达式字符串。flags：控制正则表达式行为的一组标志。常用的标志包括 HS_FLAG_CASELESS（不区分大小写匹配）、HS_FLAG_DOTALL（点号.匹配任何字符，包括换行符）等。mode：编译模式。可以是 HS_MODE_BLOCK（用于块扫描模式）、HS_MODE_STREAM（用于流扫描模式）或 HS_MODE_VECTORED（用于向量扫描模式）。platform：指向 hs_platform_info_t 结构的指针，用于提供特定平台的优化信息。如果为 NULL，则 Hyperscan 将使用默认的平台特定优化。db：指向 hs_database_t 指针的指针。编译成功后，这个指针将指向新创建的 Hyperscan 数据库。error：如果编译失败，hs_compile_error_t 结构将提供错误信息。这个结构包含错误类型和描述字符串。*/



    // 执行匹配    const char *to_scan = "hello world";    if (hs_scan(database, to_scan, strlen(to_scan), 0, scratch, eventHandler, NULL) != HS_SUCCESS) {        fprintf(stderr, "错误: 无法扫描输入缓冲区。退出。n");        hs_free_scratch(scratch);        hs_free_database(database);        return 1;    }/*hs_error_t hs_scan(    const hs_database_t *db,    const char *data,    unsigned int length,    unsigned int flags,    hs_scratch_t *scratch,    match_event_handler onEvent,    void *context);*//*db：一个指向之前使用 hs_compile() 或相关函数编译的 Hyperscan 数据库的指针。data：要扫描的文本数据。length：data 参数指向的文本数据的长度。flags：控制扫描行为的标志。一般情况下，这个值被设置为 0。scratch：指向一个由 hs_alloc_scratch() 或 hs_clone_scratch() 创建的临时“scratch”空间的指针。这个空间用于存储扫描过程中的状态信息。onEvent：一个函数指针，指向的函数将在每次匹配发生时被调用。这个函数应该匹配 match_event_handler 的原型。context：一个指向任意用户定义数据的指针，这个指针将被传递给 onEvent 函数。*/

    // 清理    hs_free_scratch(scratch);    hs_free_database(database);

    return 0;}

执行这个测试程序：

./demo在位置 5 处匹配到模式 "0"

2.2 Block模式单文件

了解了hyperscan进行正则匹配的基本流程，就可以开始根据需求编写程序，以提取身份证号、手机号为例：

#include <stdio.h>#include <stdlib.h>#include <string.h>#include <hs.h>

// 匹配事件处理函数static int event_handler(unsigned int id, unsigned long long from,                         unsigned long long to, unsigned int flags, void *context) {    char *string = (char *)context;

    // 输出匹配的字符串    printf("匹配到模式 "%u"，位置从 %llu 到 %llu: %.*sn", id, from, to, (int)(to - from), string + from);

    return 0;}

int main() {    hs_database_t *database;    hs_compile_error_t *compile_err;    hs_scratch_t *scratch = NULL;

    // 正则表达式: 手机号和身份证号    const char *expressions[] = {"\b1[3-9]\d{9}\b", "\b\d{18}\b"};    unsigned flags[] = {HS_FLAG_DOTALL , HS_FLAG_DOTALL };    unsigned ids[] = {1, 2};

    // 编译正则表达式    if (hs_compile_multi(expressions, flags, ids, 2, HS_MODE_BLOCK, NULL, &database, &compile_err) != HS_SUCCESS) {        fprintf(stderr, "Hyperscan 编译失败: %sn", compile_err->message);        hs_free_compile_error(compile_err);        return 1;    }

    // 为匹配准备 "scratch" 空间    if (hs_alloc_scratch(database, &scratch) != HS_SUCCESS) {        fprintf(stderr, "无法分配 scratch 空间。退出。n");        hs_free_database(database);        return 1;    }

    // 读取文件    const char *filename = "test.txt"; // 替换为您的文件名    FILE *file = fopen(filename, "r");    if (file == NULL) {        perror("无法打开文件");        hs_free_scratch(scratch);        hs_free_database(database);        return 1;    }

    fseek(file, 0, SEEK_END);    long length = ftell(file);    fseek(file, 0, SEEK_SET);    char *buffer = malloc(length);    if (buffer) {        fread(buffer, 1, length, file);    }    fclose(file);

    // 执行扫描    if (hs_scan(database, buffer, length, 0, scratch, event_handler, buffer) != HS_SUCCESS) {        fprintf(stderr, "Hyperscan 扫描失败。n");    }

    // 清理    free(buffer);    hs_free_scratch(scratch);    hs_free_database(database);

    return 0;}

test.txt内容如下：

手机号测试13111111111，这个长度不对1878565321，这个也不对2222555566，这个突然对了15888888888。还有身份证号371511111111111111。

运行结果：

./demo匹配到模式 "1"，位置从 0 到 26: 手机号测试13111111111匹配到模式 "1"，位置从 0 到 117: 手机号测试13111111111，这个长度不对1878565321，这个也不对2222555566，这个突然对了15888888888匹配到模式 "2"，位置从 0 到 156: 手机号测试13111111111，这个长度不对1878565321，这个也不对2222555566，这个突然对了15888888888。还有身份证号371511111111111111

运行发现问题，回调函数中from的值总是为0，这就导致通过to-from提取数据会从开头开始打印。这时我们再看关于from的说明匹配的起始位置偏移量，仅在（编译阶段）设置了返回起始位置偏移量的情况下有效。所以要在编译阶段(hs_compile())通过flags设置 HS_FLAG_SOM_LEFTMOST标志，即可获取起始位置的偏移量。

 unsigned flags[] = {HS_FLAG_DOTALL | HS_FLAG_SOM_LEFTMOST, HS_FLAG_DOTALL | HS_FLAG_SOM_LEFTMOST};

再次编译并执行，结果如下：

./demo匹配到模式 "1"，位置从 15 到 26: 13111111111匹配到模式 "1"，位置从 106 到 117: 15888888888匹配到模式 "2"，位置从 138 到 156: 371511111111111111

2.3 Stream模式多文件

当文件非常大或数据是以流的形式连续生成时。stream模式允许对数据进行分块处理，无需一次性加载整个文件，一般应用于大型日志文件或连续的数据流。本文程序只为展示用法，实际应用需根据实际情况选择合适的模式及规则。

#include <stdio.h>#include <stdlib.h>#include <string.h>#include <dirent.h>#include <hs.h>

// 匹配事件处理函数static int event_handler(unsigned int id, unsigned long long from,                         unsigned long long to, unsigned int flags, void *context) {    char *string = (char *)context;    printf("Match for pattern "%u" from %llu to %llu: %.*sn", id, from, to, (int)(to - from), string + from);    return 0;}

// 函数：扫描单个文件void scan_file(const char *filename, hs_database_t *database, hs_scratch_t *scratch) {    printf("Scanning file: %sn", filename);    FILE *file = fopen(filename, "r");    if (!file) {        perror("Unable to open file");        return;    }

    // 获取文件大小    fseek(file, 0, SEEK_END);    long length = ftell(file);    fseek(file, 0, SEEK_SET);

    // 读取文件内容    char *buffer = (char *)malloc(length);    if (!buffer) {        perror("Memory allocation failed");        fclose(file);        return;    }    fread(buffer, 1, length, file);    fclose(file);

    // 为每个文件创建一个新的流    hs_stream_t *stream;    if (hs_open_stream(database, 0, &stream) != HS_SUCCESS) {        fprintf(stderr, "Failed to open streamn");        free(buffer);        return;    }

    // 执行扫描    if (hs_scan_stream(stream, buffer, length, 0, scratch, event_handler, buffer) != HS_SUCCESS) {        fprintf(stderr, "Hyperscan scan failed.n");    }

    // 关闭流    hs_close_stream(stream, scratch, NULL, NULL);

    // 释放内存    free(buffer);}

int main() {    hs_database_t *database;    hs_compile_error_t *compile_err;    hs_scratch_t *scratch = NULL;

    // 正则表达式: 手机号和身份证号    const char *expressions[] = {"\b1[3-9]\d{9}\b", "\b\d{18}\b"};    unsigned flags[] = {HS_FLAG_DOTALL | HS_FLAG_SOM_LEFTMOST, HS_FLAG_DOTALL | HS_FLAG_SOM_LEFTMOST};    unsigned ids[] = {1, 2};

    // 编译正则表达式    if (hs_compile_multi(expressions, flags, ids, 2,                         HS_MODE_STREAM | HS_MODE_SOM_HORIZON_LARGE, NULL,                         &database, &compile_err) != HS_SUCCESS) {        fprintf(stderr, "Hyperscan compilation failed: %sn", compile_err->message);        hs_free_compile_error(compile_err);        return 1;    }



    // 为匹配准备 "scratch" 空间    if (hs_alloc_scratch(database, &scratch) != HS_SUCCESS) {        fprintf(stderr, "Unable to allocate scratch space. Exiting.n");        hs_free_database(database);        return 1;    }

    // 遍历文件夹并扫描每个文件    const char *folder = "./test"; // 替换为您的文件夹路径    DIR *d = opendir(folder);    struct dirent *dir;    if (d) {        while ((dir = readdir(d)) != NULL) {            if (dir->d_type == DT_REG) { // 检查是否为普通文件                char filepath[1024];                snprintf(filepath, sizeof(filepath), "%s/%s", folder, dir->d_name);                scan_file(filepath, database, scratch);            }        }        closedir(d);    }

    // 清理    hs_free_scratch(scratch);    hs_free_database(database);    return 0;}

测试文件为单文件所使用的文件，以及某数据算法题目中的一个测试文件，执行结果：

./demo

Scanning file: ./test/1.txtMatch for pattern "2" from 37 to 55: 341023198923223406Match for pattern "2" from 216 to 234: 652900201918066250Match for pattern "2" from 317 to 335: 350583198620191288Match for pattern "2" from 408 to 426: 370705196216108653Match for pattern "2" from 499 to 517: 330784195316240110Match for pattern "2" from 610 to 628: 320282192816222458Match for pattern "2" from 736 to 754: 152200194024042955Match for pattern "2" from 1048 to 1066: 360726201822048906Match for pattern "2" from 1136 to 1154: 621200194524227149Match for pattern "2" from 1252 to 1270: 513301195713161348Match for pattern "2" from 1360 to 1378: 371622192023032860Match for pattern "2" from 1462 to 1480: 321203192719178358Match for pattern "2" from 1700 to 1718: 610523201519119291Match for pattern "2" from 1908 to 1926: 530681192919028409Match for pattern "2" from 2017 to 2035: 441621202115175226Match for pattern "2" from 2108 to 2126: 640502196519185029Match for pattern "2" from 2226 to 2244: 533122194014197894Match for pattern "2" from 2316 to 2334: 520325198022067460Scanning file: ./test/2.txtMatch for pattern "1" from 15 to 26: 13111111111Match for pattern "1" from 106 to 117: 15888888888Match for pattern "2" from 138 to 156: 371511111111111111

注意flags为HS_MODE_SOM_HORIZON_LARGE。在使用hyperscan 的流模式 (HS_MODE_STREAM)时，如果要使用起始位置偏移量 (SOM) 表达式标志 (HS_FLAG_SOM_LEFTMOST)，必须指定一个SOM精度模式。hyperscan提供了几种不同的SOM精度模式，如HS_MODE_SOM_HORIZON_SMALL、HS_MODE_SOM_HORIZON_MEDIUM和 HS_MODE_SOM_HORIZON_LARGE，这些模式用于控制内存使用和匹配精度的平衡。其中的SMALL、MEDIUM、LARGE表示匹配精度，精度越高占用的资源同样越高。

0x03 总结

hyperscan提供了一个强大而灵活的解决方案，用于快速、高效地处理正则表达式匹配任务，特别是在处理大量小型文件时。无论是在日志分析、数据提取还是安全监测等领域，hyperscan都证明了其作为一种高性能正则表达式匹配工具的价值和能力。

『杂项』高效文本分析：使用 Hyperscan 进行正则匹配

免责声明：本文仅供安全研究与讨论之用，严禁用于非法用途，违者后果自负。

点此亲

原文始发于微信公众号（宸极实验室）：『杂项』高效文本分析：使用 Hyperscan 进行正则匹配

左青龙
微信扫一扫

右白虎
微信扫一扫

『杂项』高效文本分析：使用 Hyperscan 进行正则匹配

0x00 前言

0x01 环境配置

1.1 WSL配置

1.2 Hyperscan编译安装

0x02 应用案例

2.1 helloworld

2.2 Block模式单文件

2.3 Stream模式多文件

0x03 总结

歪脖子技能树 - DY直播工具V1.0

dirsearch_bypass403:目录扫描+JS文件中提取URL和子域+403状态绕过+指纹识别工具

内网穿透工具 nc

开源、离线、免费商用的大模型知识库:Langchain-Chatchat

CS插件乱码

403绕过目录扫描工具

【GitHub精选】红蓝功防资料汇总

工具 | navgix

红队免杀木马自动生成器 | Qianji

RustScan 一款3秒内可扫描65k个端口的探测工具|漏洞挖掘

发表评论

在线咨询

微信