[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

admin 2024年4月10日10:09:08评论6 views字数 28192阅读93分58秒阅读模式

《当人工智能遇上安全》系列博客将详细介绍人工智能与安全相关的论文、实践,并分享各种案例,涉及恶意代码检测、恶意请求识别、入侵检测、对抗样本等等。只想更好地帮助初学者,更加成体系的分享新知识。该系列文章会更加聚焦,更加学术,更加深入,也是作者的慢慢成长史。换专业确实挺难的,系统安全也是块硬骨头,但我也试试,看看自己未来四年究竟能将它学到什么程度,漫漫长征路,偏向虎山行。享受过程,一起加油~

前文讲解LSTM恶意请求识别。这篇文章将详细结合如何利用keras和tensorflow构建基于注意力机制的CNN-BiLSTM-ATT-CRF模型,并实现中文实体识别研究,同时对注意力机制构建常见错误进行探讨。基础性文章,希望对您有帮助,如果存在错误或不足之处,还请海涵。且看且珍惜!

  • 版本信息:python 3.7,tf 2.2.0,keras 2.3.1,bert4keras 0.11.5,keras-contrib=2.0.8

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

文章目录

  • 一.ATT&CK数据采集

  • 二.数据预处理

  • 三.安装环境

    • 1.安装keras-contrib

    • 2.安装keras

  • 四.CNN-BiLSTM-ATT-CRF模型构建

  • 五.完整代码及实验结果

  • 六.Attention构建及兼容问题

  • 七.总结

作者作为网络安全的小白,分享一些自学基础教程给大家,主要是在线笔记,希望您们喜欢。同时,更希望您能与我一起操作和进步,后续将深入学习AI安全和系统安全知识并分享相关实验。总之,希望该系列文章对博友有所帮助,写文不易,大神们不喜勿喷,谢谢!如果文章对您有帮助,将是我创作的最大动力,点赞、评论、私聊均可,一起加油喔!

前文推荐:

作者的github资源:

  • https://github.com/eastmountyxz/AI-Security-Paper

  • https://github.com/eastmountyxz/When-AI-meet-Security

一.ATT&CK数据采集

了解威胁情报的同学,应该都熟悉Mitre的ATT&CK网站,前文已介绍如何采集该网站APT组织的攻击技战术数据。网址如下:

  • http://attack.mitre.org

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

第一步,通过ATT&CK网站源码分析定位APT组织名称,并进行系统采集。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

安装BeautifulSoup扩展包,该部分代码如下所示:

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

01-get-aptentity.py

#encoding:utf-8#By:Eastmount CSDNimport reimport requestsfrom lxml import etreefrom bs4 import BeautifulSoupimport urllib.request#-------------------------------------------------------------------------------------------#获取APT组织名称及链接#设置浏览器代理,它是一个字典headers = {    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)         AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}url = 'https://attack.mitre.org/groups/'#向服务器发出请求r = requests.get(url = url, headers = headers).text#解析DOM树结构html_etree = etree.HTML(r)names = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/text()')print (names)print(len(names),names[0])filename = []for name in names:    filename.append(name.strip())print(filename)#链接urls = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/@href')print(urls)print(len(urls), urls[0])print("n")

此时输出结果如下图所示,包括APT组织名称及对应的URL网址。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

第二步,访问APT组织对应的URL,采集详细信息(正文描述)。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

第三步,采集对应的技战术TTPs信息,其源码定位如下图所示。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

第四步,编写代码完成威胁情报数据采集。01-spider-mitre.py 完整代码如下:

#encoding:utf-8#By:Eastmount CSDNimport reimport requestsfrom lxml import etreefrom bs4 import BeautifulSoupimport urllib.request#-------------------------------------------------------------------------------------------#获取APT组织名称及链接#设置浏览器代理,它是一个字典headers = {    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)         AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}url = 'https://attack.mitre.org/groups/'#向服务器发出请求r = requests.get(url = url, headers = headers).text#解析DOM树结构html_etree = etree.HTML(r)names = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/text()')print (names)print(len(names),names[0])#链接urls = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/@href')print(urls)print(len(urls), urls[0])print("n")#-------------------------------------------------------------------------------------------#获取详细信息k = 0while k<len(names):    filename = str(names[k]).strip() + ".txt"    url = "https://attack.mitre.org" + urls[k]    print(url)    #获取正文信息    page = urllib.request.Request(url, headers=headers)    page = urllib.request.urlopen(page)    contents = page.read()    soup = BeautifulSoup(contents, "html.parser")    #获取正文摘要信息    content = ""    for tag in soup.find_all(attrs={"class":"description-body"}):        #contents = tag.find("p").get_text()        contents = tag.find_all("p")        for con in contents:            content += con.get_text().strip() + "###n"  #标记句子结束(第二部分分句用)    #print(content)    #获取表格中的技术信息    for tag in soup.find_all(attrs={"class":"table techniques-used table-bordered mt-2"}):        contents = tag.find("tbody").find_all("tr")        for con in contents:            value = con.find("p").get_text()           #存在4列或5列 故获取p值            #print(value)            content += value.strip() + "###n"         #标记句子结束(第二部分分句用)    #删除内容中的参考文献括号 [n]    result = re.sub(u"\[.*?]", "", content)    print(result)    #文件写入    filename = "Mitre//" + filename    print(filename)    f = open(filename, "w", encoding="utf-8")    f.write(result)    f.close()        k += 1

输出结果如下图所示,共整理100个组织信息。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

每个文件显示内容如下图所示:

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

数据标注采用暴力的方式进行,即定义不同类型的实体名称并利用BIO的方式进行标注。通过ATT&CK技战术方式进行标注,后续可以结合人工校正,同时可以定义更多类型的实体。

  • BIO标注
实体名称 实体数量 示例
APT攻击组织 128 APT32、Lazarus Group
攻击漏洞 56 CVE-2009-0927
区域位置 72 America、Europe
攻击行业 34 companies、finance
攻击手法 65 C&C、RAT、DDoS
利用软件 48 7-Zip、Microsoft
操作系统 10 Linux、Windows

更多标注和预处理请查看上一篇文章。

  • [当人工智能遇上安全] 10.威胁情报实体识别之基于BiLSTM-CRF的实体识别万字详解

常见的数据标注工具:

  • 图像标注:labelme,LabelImg,Labelbox,RectLabel,CVAT,VIA
  • 半自动ocr标注:PPOCRLabel
  • NLP标注工具:labelstudio

温馨提示:
由于网站的布局会不断变化和优化,因此读者需要掌握数据采集及语法树定位的基本方法,以不变应万变。此外,读者可以尝试采集所有锻炼甚至是URL跳转链接内容,请读者自行尝试和拓展!

二.数据预处理

假设存在已经采集和标注好的中文数据集,通常采用按字(Char)分隔,读者可以尝试以人民日报为数据集,下载地址如下,中文威胁情报也类似。

  • http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

当然也可以自建数据集,包括前面所说的威胁情报数据集。假设存在已经采集和标注好的中文数据集,通常采用按字(Char)分隔,如下图所示,古籍为数据集,当然中文威胁情报也类似。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

数据集划分为训练集、验证集和测试集。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

三.安装环境

1.安装keras-contrib

CRF模型作者安装的是 keras-contrib

第一步,如果读者直接使用“pip install keras-contrib”可能会报错,远程下载也报错。

  • pip install git+https://www.github.com/keras-team/keras-contrib.git

甚至会报错 ModuleNotFoundError: No module named ‘keras_contrib’。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

第二步,作者从github中下载该资源,并在本地安装。

  • https://github.com/keras-team/keras-contrib

  • keras-contrib 版本:2.0.8

git clone https://www.github.com/keras-team/keras-contrib.git
cd keras-contrib
python setup.py install

安装成功如下图所示:

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

读者可以从我的资源中下载代码和扩展包。

  • https://github.com/eastmountyxz/When-AI-meet-Security

2.安装keras

同样需要安装keras和TensorFlow扩展包。

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

如果TensorFlow下载太慢,可以设置清华大学镜像,实际安装2.2版本。

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simplepip install tensorflow==2.2

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

四.CNN-BiLSTM-ATT-CRF模型构建

第一步,导入扩展包。

import reimport osimport csvimport sysimport numpy as npimport tensorflow as tfimport kerasfrom keras.models import Modelfrom keras.layers import LSTM, GRU, Activation, Dense, Dropout, Input, Embedding, Permutefrom keras.layers import Convolution1D, MaxPool1D, Flatten, TimeDistributed, Maskingfrom keras.optimizers import RMSpropfrom keras.layers import Bidirectionalfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing import sequencefrom keras.callbacks import EarlyStoppingfrom keras.models import load_modelfrom keras.models import Sequentialfrom keras.layers.merge import concatenatefrom keras import backend as Kfrom keras_contrib.layers import CRFfrom keras_contrib.losses import crf_lossfrom keras_contrib.metrics import crf_viterbi_accuracy

第二步,数据预处理及设置参数。

train_data_path = "data/train.csv"test_data_path = "data/test.csv"val_data_path = "data/val.csv"char_vocab_path = "char_vocabs_.txt"   #字典文件(防止多次写入仅读首次生成文件)special_words = ['<PAD>', '<UNK>']     #特殊词表示final_words = []                       #统计词典(不重复出现)final_labels = []                      #统计标记(不重复出现)#BIO标记的标签 字母O初始标记为0label2idx = {'O': 0,             'S-LOC': 1, 'B-LOC': 2,  'I-LOC': 3,  'E-LOC': 4,             'S-PER': 5, 'B-PER': 6,  'I-PER': 7,  'E-PER': 8,             'S-TIM': 9, 'B-TIM': 10, 'E-TIM': 11, 'I-TIM': 12             }print(label2idx)#{'S-LOC': 0, 'B-PER': 1, 'I-PER': 2, ...., 'I-TIM': 11, 'I-LOC': 12}#索引和BIO标签对应idx2label = {idx: label for label, idx in label2idx.items()}print(idx2label)#{0: 'S-LOC', 1: 'B-PER', 2: 'I-PER', ...., 11: 'I-TIM', 12: 'I-LOC'}#读取字符词典文件with open(char_vocab_path, "r", encoding="utf8") as fo:    char_vocabs = [line.strip() for line in fo]char_vocabs = special_words + char_vocabsprint(char_vocabs)#['<PAD>', '<UNK>', '晉', '樂', '王', '鮒', '曰', ':', '小', '旻', ...]# 字符和索引编号对应idx2vocab = {idx: char for idx, char in enumerate(char_vocabs)}vocab2idx = {char: idx for idx, char in idx2vocab.items()}print(idx2vocab)#{0: '<PAD>', 1: '<UNK>', 2: '晉', 3: '樂', ...}print(vocab2idx)#{'<PAD>': 0, '<UNK>': 1, '晉': 2, '樂': 3, ...}

第三步,定义函数读取数据。

def read_corpus(corpus_path, vocab2idx, label2idx):    datas, labels = [], []    with open(corpus_path, encoding='utf-8') as fr:        lines = fr.readlines()    sent_, tag_ = [], []    for line in lines:        line = line.strip()        #print(line)        if line != '':          #断句            value = line.split(",")            word,label = value[0],value[4]            #汉字及标签逐一添加列表  ['晉', '樂'] ['S-LOC', 'B-PER']            sent_.append(word)            tag_.append(label)            """            print(sent_) #['晉', '樂', '王', '鮒', '曰', ':']            print(tag_)  #['S-LOC', 'B-PER', 'I-PER', 'E-PER', 'O', 'O']            """        else:                   #vocab2idx[0] => <PAD>            sent_ids = [vocab2idx[char] if char in vocab2idx else vocab2idx['<UNK>'] for char in sent_]            tag_ids = [label2idx[label] if label in label2idx else 0 for label in tag_]            datas.append(sent_ids) #按句插入列表            labels.append(tag_ids)            sent_, tag_ = [], []    return datas, labels#原始数据train_datas_, train_labels_ = read_corpus(train_data_path, vocab2idx, label2idx)test_datas_, test_labels_ = read_corpus(test_data_path, vocab2idx, label2idx)val_datas_, val_labels_ = read_corpus(val_data_path, vocab2idx, label2idx)#输出测试结果 (第五句语料)print(len(train_datas_),len(train_labels_),len(test_datas_),      len(test_labels_),len(val_datas_),len(val_labels_))print(train_datas_[5])print([idx2vocab[idx] for idx in train_datas_[5]])print(train_labels_[5])print([idx2label[idx] for idx in train_labels_[5]])

第四步,数据填充和one-hot编码。

MAX_LEN = 100VOCAB_SIZE = len(vocab2idx)CLASS_NUMS = len(label2idx)#padding dataprint('padding sequences')train_datas = sequence.pad_sequences(train_datas_, maxlen=MAX_LEN)train_labels = sequence.pad_sequences(train_labels_, maxlen=MAX_LEN)test_datas = sequence.pad_sequences(test_datas_, maxlen=MAX_LEN)test_labels = sequence.pad_sequences(test_labels_, maxlen=MAX_LEN)print('x_train shape:', train_datas.shape)print('x_test shape:', test_datas.shape)#(15362, 100) (1919, 100)#encoder one-hottrain_labels = keras.utils.to_categorical(train_labels, CLASS_NUMS)test_labels = keras.utils.to_categorical(test_labels, CLASS_NUMS)print('trainlabels shape:', train_labels.shape)print('testlabels shape:', test_labels.shape)#(15362, 100, 13) (1919, 100, 13)

第五步,建立Attention机制。

K.clear_session()SINGLE_ATTENTION_VECTOR = Falsedef attention_3d_block(inputs):    # inputs.shape = (batch_size, time_steps, input_dim)    input_dim = int(inputs.shape[2])    a = inputs    a = Dense(input_dim, activation='softmax')(a)    if SINGLE_ATTENTION_VECTOR:        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)        a = RepeatVector(input_dim)(a)    a_probs = Permute((1, 2), name='attention_vec')(a)    #output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')    output_attention_mul = concatenate([inputs, a_probs])    return output_attention_mul

第六步,构建ATT+CNN-BiLSTM+CRF模型。

EPOCHS = 2EMBED_DIM = 128HIDDEN_SIZE = 64MAX_LEN = 100VOCAB_SIZE = len(vocab2idx)CLASS_NUMS = len(label2idx)#模型构建inputs = Input(shape=(MAX_LEN,), dtype='int32')x = Masking(mask_value=0)(inputs)x = Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=False)(x) #修改掩码False#CNNcnn1 = Convolution1D(64, 3, padding='same', strides = 1, activation='relu')(x)cnn1 = MaxPool1D(pool_size=1)(cnn1)cnn2 = Convolution1D(64, 4, padding='same', strides = 1, activation='relu')(x)cnn2 = MaxPool1D(pool_size=1)(cnn2)cnn3 = Convolution1D(64, 5, padding='same', strides = 1, activation='relu')(x)cnn3 = MaxPool1D(pool_size=1)(cnn3)cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)#BiLSTMbilstm = Bidirectional(LSTM(64, return_sequences=True))(cnn) #参数保持维度3 layer = Dense(64, activation='relu')(bilstm)layer = Dropout(0.3)(layer)#注意力attention_mul = attention_3d_block(layer) #(None, 100, 128)x = TimeDistributed(Dense(CLASS_NUMS))(attention_mul)outputs = CRF(CLASS_NUMS)(x)model = Model(inputs=inputs, outputs=outputs)model.summary()

第七步,模型训练和预测。

flag = "train"if flag=="train":    #模型训练    model.compile(loss=crf_loss, optimizer='adam', metrics=[crf_viterbi_accuracy])    model.fit(train_datas, train_labels, epochs=EPOCHS, verbose=1, validation_split=0.1)    score = model.evaluate(test_datas, test_labels, batch_size=256)    print(model.metrics_names)    print(score)    model.save("att_cnn_crf_bilstm_ner_model.h5")elif flag=="test":    #训练模型    char_vocab_path = "char_vocabs_.txt"                #字典文件    model_path = "att_cnn_crf_bilstm_ner_model.h5"      #模型文件    ner_labels = label2idx    special_words = ['<PAD>', '<UNK>']    MAX_LEN = 100    #预测结果    model = load_model(model_path, custom_objects={'CRF': CRF}, compile=False)        y_pred = model.predict(test_datas)    y_labels = np.argmax(y_pred, axis=2)         #取最大值    z_labels = np.argmax(test_labels, axis=2)    #真实值    word_labels = test_datas                     #真实值    k = 0    final_y = []       #预测结果对应的标签    final_z = []       #真实结果对应的标签    final_word = []    #对应的特征单词    while k<len(y_labels):        y = y_labels[k]        for idx in y:            final_y.append(idx2label[idx])        #print("预测结果:", [idx2label[idx] for idx in y])        z = z_labels[k]        for idx in z:                final_z.append(idx2label[idx])        #print("真实结果:", [idx2label[idx] for idx in z])        word = word_labels[k]        for idx in word:            final_word.append(idx2vocab[idx])        k += 1    print("最终结果大小:", len(final_y),len(final_z)) #191900 191900    n = 0    numError = 0    numRight = 0    while n<len(final_y):        if final_y[n]!=final_z[n] and final_z[n]!='O':            numError += 1        if final_y[n]==final_z[n] and final_z[n]!='O':            numRight += 1        n += 1    print("预测错误数量:", numError)    print("预测正确数量:", numRight)    print("Acc:", numRight*1.0/(numError+numRight))    print(y_pred.shape, len(test_datas_), len(test_labels_))    print("预测单词:", [idx2vocab[idx] for idx in test_datas_[5]])    print("真实结果:", [idx2label[idx] for idx in test_labels_[5]])    print("预测结果:", [idx2label[idx] for idx in y_labels[5]][-len(test_datas_[5]):])    #文件存储    fw = open("Final_ATT_CNN_BiLSTM_CRF_Result.csv", "w", encoding="utf8", newline='')    fwrite = csv.writer(fw)    fwrite.writerow(['pre_label','real_label', 'word'])    n = 0    while n<len(final_y):        fwrite.writerow([final_y[n],final_z[n],final_word[n]])        n += 1    fw.close()

五.完整代码及实验结果

完整代码如下所示:

# encoding:utf-8# By: Eastmount 2024-03-29# keras-contrib=2.0.8  Keras=2.3.1  tensorflow=2.2.0  tensorflow-gpu=2.2.0  bert4keras=0.11.5import reimport osimport csvimport sysimport numpy as npimport tensorflow as tfimport kerasfrom keras.models import Modelfrom keras.layers import LSTM, GRU, Activation, Dense, Dropout, Input, Embedding, Permutefrom keras.layers import Convolution1D, MaxPool1D, Flatten, TimeDistributed, Maskingfrom keras.optimizers import RMSpropfrom keras.layers import Bidirectionalfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing import sequencefrom keras.callbacks import EarlyStoppingfrom keras.models import load_modelfrom keras.models import Sequentialfrom keras.layers.merge import concatenatefrom keras import backend as Kfrom keras_contrib.layers import CRFfrom keras_contrib.losses import crf_lossfrom keras_contrib.metrics import crf_viterbi_accuracy#------------------------------------------------------------------------#第一步 数据预处理#------------------------------------------------------------------------train_data_path = "data/train.csv"test_data_path = "data/test.csv"val_data_path = "data/val.csv"char_vocab_path = "char_vocabs_.txt"   #字典文件(防止多次写入仅读首次生成文件)special_words = ['<PAD>', '<UNK>']     #特殊词表示final_words = []                       #统计词典(不重复出现)final_labels = []                      #统计标记(不重复出现)#BIO标记的标签 字母O初始标记为0label2idx = {'O': 0,             'S-LOC': 1, 'B-LOC': 2,  'I-LOC': 3,  'E-LOC': 4,             'S-PER': 5, 'B-PER': 6,  'I-PER': 7,  'E-PER': 8,             'S-TIM': 9, 'B-TIM': 10, 'E-TIM': 11, 'I-TIM': 12             }print(label2idx)#{'S-LOC': 0, 'B-PER': 1, 'I-PER': 2, ...., 'I-TIM': 11, 'I-LOC': 12}#索引和BIO标签对应idx2label = {idx: label for label, idx in label2idx.items()}print(idx2label)#{0: 'S-LOC', 1: 'B-PER', 2: 'I-PER', ...., 11: 'I-TIM', 12: 'I-LOC'}#读取字符词典文件with open(char_vocab_path, "r", encoding="utf8") as fo:    char_vocabs = [line.strip() for line in fo]char_vocabs = special_words + char_vocabsprint(char_vocabs)#['<PAD>', '<UNK>', '晉', '樂', '王', '鮒', '曰', ':', '小', '旻', ...]# 字符和索引编号对应idx2vocab = {idx: char for idx, char in enumerate(char_vocabs)}vocab2idx = {char: idx for idx, char in idx2vocab.items()}print(idx2vocab)#{0: '<PAD>', 1: '<UNK>', 2: '晉', 3: '樂', ...}print(vocab2idx)#{'<PAD>': 0, '<UNK>': 1, '晉': 2, '樂': 3, ...}#------------------------------------------------------------------------#第二步 读取数据#------------------------------------------------------------------------def read_corpus(corpus_path, vocab2idx, label2idx):    datas, labels = [], []    with open(corpus_path, encoding='utf-8') as fr:        lines = fr.readlines()    sent_, tag_ = [], []    for line in lines:        line = line.strip()        #print(line)        if line != '':          #断句            value = line.split(",")            word,label = value[0],value[4]            #汉字及标签逐一添加列表  ['晉', '樂'] ['S-LOC', 'B-PER']            sent_.append(word)            tag_.append(label)            """            print(sent_) #['晉', '樂', '王', '鮒', '曰', ':']            print(tag_)  #['S-LOC', 'B-PER', 'I-PER', 'E-PER', 'O', 'O']            """        else:                   #vocab2idx[0] => <PAD>            sent_ids = [vocab2idx[char] if char in vocab2idx else vocab2idx['<UNK>'] for char in sent_]            tag_ids = [label2idx[label] if label in label2idx else 0 for label in tag_]            datas.append(sent_ids) #按句插入列表            labels.append(tag_ids)            sent_, tag_ = [], []    return datas, labels#原始数据train_datas_, train_labels_ = read_corpus(train_data_path, vocab2idx, label2idx)test_datas_, test_labels_ = read_corpus(test_data_path, vocab2idx, label2idx)val_datas_, val_labels_ = read_corpus(val_data_path, vocab2idx, label2idx)#输出测试结果 (第五句语料)print(len(train_datas_),len(train_labels_),len(test_datas_),      len(test_labels_),len(val_datas_),len(val_labels_))print(train_datas_[5])print([idx2vocab[idx] for idx in train_datas_[5]])print(train_labels_[5])print([idx2label[idx] for idx in train_labels_[5]])#------------------------------------------------------------------------#第三步 数据填充 one-hot编码#------------------------------------------------------------------------MAX_LEN = 100VOCAB_SIZE = len(vocab2idx)CLASS_NUMS = len(label2idx)#padding dataprint('padding sequences')train_datas = sequence.pad_sequences(train_datas_, maxlen=MAX_LEN)train_labels = sequence.pad_sequences(train_labels_, maxlen=MAX_LEN)test_datas = sequence.pad_sequences(test_datas_, maxlen=MAX_LEN)test_labels = sequence.pad_sequences(test_labels_, maxlen=MAX_LEN)print('x_train shape:', train_datas.shape)print('x_test shape:', test_datas.shape)#(15362, 100) (1919, 100)#encoder one-hottrain_labels = keras.utils.to_categorical(train_labels, CLASS_NUMS)test_labels = keras.utils.to_categorical(test_labels, CLASS_NUMS)print('trainlabels shape:', train_labels.shape)print('testlabels shape:', test_labels.shape)#(15362, 100, 13) (1919, 100, 13)#------------------------------------------------------------------------#第四步 建立Attention机制#------------------------------------------------------------------------K.clear_session()SINGLE_ATTENTION_VECTOR = Falsedef attention_3d_block(inputs):    # inputs.shape = (batch_size, time_steps, input_dim)    input_dim = int(inputs.shape[2])    a = inputs    a = Dense(input_dim, activation='softmax')(a)    if SINGLE_ATTENTION_VECTOR:        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)        a = RepeatVector(input_dim)(a)    a_probs = Permute((1, 2), name='attention_vec')(a)    #output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')    output_attention_mul = concatenate([inputs, a_probs])    return output_attention_mul#------------------------------------------------------------------------#第五步 构建ATT+CNN-BiLSTM+CRF模型#------------------------------------------------------------------------EPOCHS = 2EMBED_DIM = 128HIDDEN_SIZE = 64MAX_LEN = 100VOCAB_SIZE = len(vocab2idx)CLASS_NUMS = len(label2idx)print(VOCAB_SIZE, CLASS_NUMS) #3319 13#模型构建inputs = Input(shape=(MAX_LEN,), dtype='int32')x = Masking(mask_value=0)(inputs)x = Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=False)(x) #修改掩码False#CNNcnn1 = Convolution1D(64, 3, padding='same', strides = 1, activation='relu')(x)cnn1 = MaxPool1D(pool_size=1)(cnn1)cnn2 = Convolution1D(64, 4, padding='same', strides = 1, activation='relu')(x)cnn2 = MaxPool1D(pool_size=1)(cnn2)cnn3 = Convolution1D(64, 5, padding='same', strides = 1, activation='relu')(x)cnn3 = MaxPool1D(pool_size=1)(cnn3)cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)print(cnn.shape)   #(None, 100, 384)#BiLSTMbilstm = Bidirectional(LSTM(64, return_sequences=True))(cnn) #参数保持维度3 layer = Dense(64, activation='relu')(bilstm)layer = Dropout(0.3)(layer)print(layer.shape) #(None, 100, 64)#注意力attention_mul = attention_3d_block(layer) #(None, 100, 128)print(attention_mul.shape)x = TimeDistributed(Dense(CLASS_NUMS))(attention_mul)print(x.shape)     #(None, 3, 13)outputs = CRF(CLASS_NUMS)(x)print(outputs.shape)     #(None, 100, 13)print(inputs.shape)      #(None, 100)model = Model(inputs=inputs, outputs=outputs)model.summary()#------------------------------------------------------------------------#第六步 模型训练和预测#------------------------------------------------------------------------flag = "train"if flag=="train":    #模型训练    model.compile(loss=crf_loss, optimizer='adam', metrics=[crf_viterbi_accuracy])    model.fit(train_datas, train_labels, epochs=EPOCHS, verbose=1, validation_split=0.1)    score = model.evaluate(test_datas, test_labels, batch_size=256)    print(model.metrics_names)    print(score)    model.save("att_cnn_crf_bilstm_ner_model.h5")elif flag=="test":    #训练模型    char_vocab_path = "char_vocabs_.txt"                #字典文件    model_path = "att_cnn_crf_bilstm_ner_model.h5"      #模型文件    ner_labels = label2idx    special_words = ['<PAD>', '<UNK>']    MAX_LEN = 100    #预测结果    model = load_model(model_path, custom_objects={'CRF': CRF}, compile=False)        y_pred = model.predict(test_datas)    y_labels = np.argmax(y_pred, axis=2)         #取最大值    z_labels = np.argmax(test_labels, axis=2)    #真实值    word_labels = test_datas                     #真实值    k = 0    final_y = []       #预测结果对应的标签    final_z = []       #真实结果对应的标签    final_word = []    #对应的特征单词    while k<len(y_labels):        y = y_labels[k]        for idx in y:            final_y.append(idx2label[idx])        #print("预测结果:", [idx2label[idx] for idx in y])        z = z_labels[k]        for idx in z:                final_z.append(idx2label[idx])        #print("真实结果:", [idx2label[idx] for idx in z])        word = word_labels[k]        for idx in word:            final_word.append(idx2vocab[idx])        k += 1    print("最终结果大小:", len(final_y),len(final_z)) #191900 191900    n = 0    numError = 0    numRight = 0    while n<len(final_y):        if final_y[n]!=final_z[n] and final_z[n]!='O':            numError += 1        if final_y[n]==final_z[n] and final_z[n]!='O':            numRight += 1        n += 1    print("预测错误数量:", numError)    print("预测正确数量:", numRight)    print("Acc:", numRight*1.0/(numError+numRight))    print(y_pred.shape, len(test_datas_), len(test_labels_))    print("预测单词:", [idx2vocab[idx] for idx in test_datas_[5]])    print("真实结果:", [idx2label[idx] for idx in test_labels_[5]])    print("预测结果:", [idx2label[idx] for idx in y_labels[5]][-len(test_datas_[5]):])    #文件存储    fw = open("Final_ATT_CNN_BiLSTM_CRF_Result.csv", "w", encoding="utf8", newline='')    fwrite = csv.writer(fw)    fwrite.writerow(['pre_label','real_label', 'word'])    n = 0    while n<len(final_y):        fwrite.writerow([final_y[n],final_z[n],final_word[n]])        n += 1    fw.close()

运行所构建的模型如下:

Model: "model_1"__________________________________________________________________________________________________Layer (type)                    Output Shape         Param #     Connected to                     ==================================================================================================input_1 (InputLayer)            (None, 100)          0                                            __________________________________________________________________________________________________masking_1 (Masking)             (None, 100)          0           input_1[0][0]                    __________________________________________________________________________________________________embedding_1 (Embedding)         (None, 100, 128)     971904      masking_1[0][0]                  __________________________________________________________________________________________________conv1d_1 (Conv1D)               (None, 100, 64)      24640       embedding_1[0][0]                __________________________________________________________________________________________________conv1d_2 (Conv1D)               (None, 100, 64)      32832       embedding_1[0][0]                __________________________________________________________________________________________________conv1d_3 (Conv1D)               (None, 100, 64)      41024       embedding_1[0][0]                __________________________________________________________________________________________________max_pooling1d_1 (MaxPooling1D)  (None, 100, 64)      0           conv1d_1[0][0]                   __________________________________________________________________________________________________max_pooling1d_2 (MaxPooling1D)  (None, 100, 64)      0           conv1d_2[0][0]                   __________________________________________________________________________________________________max_pooling1d_3 (MaxPooling1D)  (None, 100, 64)      0           conv1d_3[0][0]                   __________________________________________________________________________________________________concatenate_1 (Concatenate)     (None, 100, 192)     0           max_pooling1d_1[0][0]                                                                             max_pooling1d_2[0][0]                                                                             max_pooling1d_3[0][0]            __________________________________________________________________________________________________bidirectional_1 (Bidirectional) (None, 100, 128)     131584      concatenate_1[0][0]              __________________________________________________________________________________________________dense_1 (Dense)                 (None, 100, 64)      8256        bidirectional_1[0][0]            __________________________________________________________________________________________________dropout_1 (Dropout)             (None, 100, 64)      0           dense_1[0][0]                    __________________________________________________________________________________________________dense_2 (Dense)                 (None, 100, 64)      4160        dropout_1[0][0]                  __________________________________________________________________________________________________attention_vec (Permute)         (None, 100, 64)      0           dense_2[0][0]                    __________________________________________________________________________________________________concatenate_2 (Concatenate)     (None, 100, 128)     0           dropout_1[0][0]                                                                                   attention_vec[0][0]              __________________________________________________________________________________________________time_distributed_1 (TimeDistrib (None, 100, 13)      1677        concatenate_2[0][0]              __________________________________________________________________________________________________crf_1 (CRF)                     (None, 100, 13)      377         time_distributed_1[0][0]         ==================================================================================================Total params: 1,216,454Trainable params: 1,216,454Non-trainable params: 0__________________________________________________________________________________________________

部分输出结果如下,包括训练过程:

Using TensorFlow backend.{'O': 0, 'S-LOC': 1, 'B-LOC': 2, 'I-LOC': 3, 'E-LOC': 4, 'S-PER': 5, 'B-PER': 6, 'I-PER': 7, 'E-PER': 8, 'S-TIM': 9, 'B-TIM': 10, 'E-TIM': 11, 'I-TIM': 12}{0: 'O', 1: 'S-LOC', 2: 'B-LOC', 3: 'I-LOC', 4: 'E-LOC', 5: 'S-PER', 6: 'B-PER', 7: 'I-PER', 8: 'E-PER', 9: 'S-TIM', 10: 'B-TIM', 11: 'E-TIM', 12: 'I-TIM'}['齊', '、', '衛', '、', '陳', '大', '夫', '其', '不', '免', '乎', '!'][1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]['S-LOC', 'O', 'S-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']Epoch 1/2   32/13825 [..............................] - ETA: 5:54 - loss: 2.6212 - crf_viterbi_accuracy: 6.2500e-04   64/13825 [..............................] - ETA: 3:20 - loss: 2.5952 - crf_viterbi_accuracy: 0.0112   96/13825 [..............................] - ETA: 2:45 - loss: 2.5627 - crf_viterbi_accuracy: 0.0517  128/13825 [..............................] - ETA: 2:37 - loss: 2.5237 - crf_viterbi_accuracy: 0.0862  ...13792/13825 [============================>.] - ETA: 0s - loss: 0.0227 - crf_viterbi_accuracy: 0.993413824/13825 [============================>.] - ETA: 0s - loss: 0.0227 - crf_viterbi_accuracy: 0.993413825/13825 [==============================] - 171s 12ms/step - loss: 0.0227 - crf_viterbi_accuracy: 0.9934 - val_loss: 0.0208 - val_crf_viterbi_accuracy: 0.9938

最终预测结果如下:

预测错误数量: 1004预测正确数量: 3395Acc: 0.7717663105251193预测单词: ['冬', ',', '楚', '公', '子', '罷', '如', '晉', '聘', ',', '且', '涖', '盟', '。']真实结果: ['O', 'O', 'B-PER', 'I-PER', 'I-PER', 'E-PER', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O']预测结果: ['O', 'O', 'S-LOC', 'B-PER', 'I-PER', 'E-PER', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O']

同时将预测结果保存,如下图所示:

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

六.Attention构建及兼容问题

上述代码中的Attention与常见的略有不同,这是为什么呢?

这是因为CRF模型要求不能降维,而传统注意力机制会将向量降一维,如下所示。从而会导致各种错误,最终CRF无法运行,比较常见的错误:

  • AttributeError: ‘NoneType’ object has no attribute ‘_inbound_nodes’

  • AttributeError: tuple object has no attribute layer

  • AttributeError: ‘Node’ object has no attribute ‘output_masks’

  • AttributeError: ‘InputLayer’ object has no attribute ‘outbound_nodes’

  • TypeError: The added layer must be an instance of class Layer.

  • TypeError: The added layer must be an instance of class Layer.

同时,Keras在2.0以后也可以通过tensorflow.keras调用,两种方式同时使用也会导致部分错误。最终通过上述的注意力模型来实现的。总之,TensorFlow和Keras版本问题真的烦人,建议大家以后都该PyTorch,后续博客也将陆续更换。

现有方法:(None, 100, 192)(None, 100, 64)(None, 100, 128)(None, 100, 13)(None, 100, 13)传统方法:(None, 100, 192)(None, 100, 64)(None, 64)(None, 13)(None, 13)

[当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型

传统的Attention代码如下:

# Hierarchical Model with Attentionfrom keras import initializersfrom keras import constraintsfrom keras import activationsfrom keras import regularizersfrom keras import backend as Kfrom keras.engine.topology import LayerK.clear_session()class AttentionLayer(Layer):    def __init__(self, attention_size=None, **kwargs):        self.attention_size = attention_size        super(AttentionLayer, self).__init__(**kwargs)    def get_config(self):        config = super().get_config()        config['attention_size'] = self.attention_size        return config    def build(self, input_shape):        assert len(input_shape) == 3        self.time_steps = input_shape[1]        hidden_size = input_shape[2]        if self.attention_size is None:            self.attention_size = hidden_size        self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size),                                initializer='uniform', trainable=True)        self.b = self.add_weight(name='att_bias', shape=(self.attention_size,),                                initializer='uniform', trainable=True)        self.V = self.add_weight(name='att_var', shape=(self.attention_size,),                                initializer='uniform', trainable=True)        super(AttentionLayer, self).build(input_shape)    def call(self, inputs):        self.V = K.reshape(self.V, (-1, 1))        H = K.tanh(K.dot(inputs, self.W) + self.b)        score = K.softmax(K.dot(H, self.V), axis=1)        outputs = K.sum(score * inputs, axis=1)        return outputs    def compute_output_shape(self, input_shape):        return input_shape[0], input_shape[2]att = AttentionLayer(attention_size=50)(layer)

七.总结

写到这里这篇文章就结束,希望对您有所帮助,后续将结合经典的Bert进行分享。忙碌的2024,真的很忙,项目本子论文毕业工作,等忙完后好好写几篇安全博客,感谢支持和陪伴,尤其是家人的鼓励和支持, 继续加油!

  • Keras下载地址:https://github.com/eastmountyxz/AI-for-Keras

Bert在Keras的NER中常用扩展包包括:

  • bert4keras

    – from bert4keras.models import build_transformer_model
    – from bert4keras.tokenizers import Tokenizer
    – from bert4keras.layers import ConditionalRandomField

  • kashgari

    – from kashgari.embeddings import BERTEmbedding
    – from kashgari.tasks.seq_labeling import BLSTMCRFModel

  • keras_bert

    – from keras_bert import Tokenizer

  • bert_serving

    – from bert_serving.server import BertServer

人生路是一个个十字路口,一次次博弈,一次次纠结和得失组成。得失得失,有得有失,不同的选择,不一样的精彩。虽然累和忙,但看到小珞珞还是挺满足的,感谢家人的陪伴。望小珞能开心健康成长,爱你们喔,继续干活,加油!

(By:Eastmount 2024-04-09 夜于贵阳)

  • 左青龙
  • 微信扫一扫
  • weinxin
  • 右白虎
  • 微信扫一扫
  • weinxin
admin
  • 本文由 发表于 2024年4月10日10:09:08
  • 转载请保留本文链接(CN-SEC中文网:感谢原作者辛苦付出):
                   [当人工智能遇上安全] 13.威胁情报实体识别 (3)利用keras构建CNN-BiLSTM-ATT-CRF实体识别模型http://cn-sec.com/archives/2643090.html

发表评论

匿名网友 填写信息