机器学习的模型融合

admin

102803
文章

87
评论

2023年7月13日15:43:12评论5 views字数 4332阅读14分26秒阅读模式

什么是模型融合‍‍‍‍

构建并结合多个学习器来完成学习任务，我们把它称为模型融合或者集成学习。不同的模型有各自的长处，具有差异性，而模型融合可以使得发挥出各个模型的优势，让这些相对较弱的模型（学习器）通过某种策略结合起来，达到比较强的模型（学习器）。

02‍

模型融合优势

•降低选错假设导致的风险

•提升捕捉到真正数据规律的可能性

•提升具有更好的泛化能力的可能性

03‍

常见的模型融合方式

3.1 Voting

将个体学习器结合在一起的时候使用的方法叫做结合策略。对于分类问题，我们可以使用投票法来选择输出最多的类。对于回归问题，我们可以将分类器输出的结果求平均值。

用多个模型对样本进行分类，以“投票”的形式，投票最多者为最终的分类。

sklearn的VotingClassifier官方demo：

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.ensemble import RandomForestClassifier, VotingClassifier
>>> clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
>>> clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
>>> clf3 = GaussianNB()
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> eclf1 = VotingClassifier(estimators=[
... ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
>>> eclf1 = eclf1.fit(X, y)
>>> print(eclf1.predict(X))
[1 1 1 2 2 2]

3.2 Bagging

bagging全称Bootstrap Aggregation，这种算法可以提高统计分类器和回归器的稳定性和准确度。同时也可以帮助模型避免过拟合。

Bootstrap Aggregating算法不直接作用于模型本身，而是作用在训练数据上。

基本思想：Bagging算法的重要内容就是对原始数据集进行有放回重采样，重新选择出S个新数据集来分别训练S个分类器的集成技术。也就是说，这些模型训练的数据中允许存在重复的数据。显然，每个样本的抽取可以由随机数实现，子数据集中样本数占原数据集中样本数的比例可预先给定，由此可决定抽取样本的次数。

>>> from sklearn.svm import SVC
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=100, n_features=4,
... n_informative=2, n_redundant=0,
... random_state=0, shuffle=False)
>>> clf = BaggingClassifier(estimator=SVC(),
... n_estimators=10, random_state=0).fit(X, y)
>>> clf.predict([[0, 0, 0, 0]])
array([1])
"""

3.3 Boosting

基础思想：Boosting是一种串行的工作机制，即个体学习器的训练存在依赖关系，必须一步一步序列化进行。Boosting是一个序列化的过程，后续模型会矫正之前模型的预测结果。也就是说，之后的模型依赖于之前的模型。增加前一个基学习器在训练训练过程中预测错误样本的权重，使得后续基学习器更加关注这些打标错误的训练样本，尽可能纠正这些错误，一直向下串行直至产生需要的T个基学习器，Boosting最终对这T个学习器进行加权结合，产生学习器组合。

与bagging本质区别：boosting是通过不断减少偏差的方式减小预测误差，而bagging是通过减少方差。

以sklearn的AdaBoostClassifier为例。

>>> from sklearn.ensemble import AdaBoostClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
... n_informative=2, n_redundant=0,
... random_state=0, shuffle=False)
>>> clf = AdaBoostClassifier(n_estimators=100, random_state=0)
>>> clf.fit(X, y)
AdaBoostClassifier(n_estimators=100, random_state=0)
>>> clf.predict([[0, 0, 0, 0]])
array([1])
>>> clf.score(X, y)
0.983...
"""

3.4 Stacking

基本思想：与其使用一个简单的方法（例如硬投票）来聚合集成器中所有模型的预测结果，为什么不直接训练一个模型来执行最后这个聚合呢？stacking 就是当用初始训练数据学习出若干个基学习器后，将这几个学习器的预测结果作为新的训练集，来学习一个新的学习器。对不同模型预测的结果再进行建模。

机器学习的模型融合

首先，直接用所有的训练数据对第一层多个模型进行k折交叉验证，这样每个模型在训练集上都有一个预测值，然后将这些预测值做为新特征对第二层的模型进行训练。相比blending，stacking两层模型都使用了全部的训练数据。

>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.svm import LinearSVC
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.ensemble import StackingClassifier
>>> X, y = load_iris(return_X_y=True)
>>> estimators = [
... ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
... ('svr', make_pipeline(StandardScaler(),
... LinearSVC(random_state=42)))
... ]
>>> clf = StackingClassifier(
... estimators=estimators, final_estimator=LogisticRegression()
... )
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, stratify=y, random_state=42
... )
>>> clf.fit(X_train, y_train).score(X_test, y_test)
0.9...
"""

3.5 blending

Blending与Stacking大致相同，只是Blending的主要区别在于训练集不是通过K-Fold的CV策略来获得预测值从而生成第二阶段模型的特征，而是建立一个Holdout集，例如10%的训练数据，第二阶段的stacker模型就基于第一阶段模型对这10%训练数据的预测值进行拟合。说白了，就是把Stacking流程中的K-Fold CV 改成 HoldOut CV。

Blending的优点在于：

1. 比stacking简单（因为不用进行k次的交叉验证来获得stacker feature）

2. 避开了一个信息泄露问题：generlizers和stacker使用了不一样的数据集

而缺点在于：

1. 使用了很少的数据（划分hold-out作为测试集，并非cv）

2. blender可能会过拟合（其实大概率是第一点导致的）

3. stacking使用多次的CV会比较稳健。

参考链接：

https://blog.csdn.net/randompeople/article/details/103452483/

https://zhuanlan.zhihu.com/p/352455052

https://zhuanlan.zhihu.com/p/443595674

https://blog.csdn.net/pearl8899/article/details/105365288

https://blog.csdn.net/weixin_54884881/article/details/123594335

https://blog.csdn.net/weixin_39613951/article/details/111373751

https://blog.csdn.net/u010412858/article/details/80785429

原文始发于微信公众号（山石网科安全技术研究院）：机器学习的模型融合

左青龙
微信扫一扫

右白虎
微信扫一扫

机器学习的模型融合

3.1 Voting

3.2 Bagging

3.3 Boosting

3.5 blending

国外专家探讨：网络安全领导力是否应该专业化问题

如何查看微信撤回的图片（不用root，极其简单），

网络安全顶刊——TDSC 2023 论文清单与摘要（4）

网络安全顶刊——TDSC 2023 论文清单与摘要（1）

网络安全顶刊——TDSC 2023 论文清单与摘要（3）

网络安全顶刊——TDSC 2023 论文清单与摘要（2）

护网避坑之——锁人合同

确定风险优先级的最佳方法 - 第 3 部分

《温暖》

金庸武侠-非线性理解的屌丝逆袭

发表评论

在线咨询

微信