写给 Python 开发者的 10 条机器学习建议

admin

102360
文章

87
评论

2020年9月13日01:45:20评论190 views字数 6852阅读22分50秒阅读模式

写给 Python 开发者的 10 条机器学习建议

有时候，作为一个数据科学家，我们常常忘记了初心。我们首先是一个开发者，然后才是研究人员，最后才可能是数学家。我们的首要职责是快速找到无 bug 的解决方案。

我们能做模型并不意味着我们就是神。这并不是编写垃圾代码的理由。

自从我开始学习机器学习以来，我犯了很多错误。因此我想把我认机器学习工程中最常用的技能分享出来。在我看来，这也是目前这个行业最缺乏的技能。

下面开始我的分享。

学习编写抽象类

一旦开始编写抽象类，你就能体会到它给带来的好处。抽象类强制子类使用相同的方法和方法名称。许多人在同一个项目上工作，如果每个人去定义不同的方法，这样做没有必要也很容易造成混乱。

 1 import os
 2 from abc import ABCMeta, abstractmethod
 3
 4
 5 class DataProcessor(metaclass=ABCMeta):
 6    """Base processor to be used for all preparation."""
 7    def __init__(self, input_directory, output_directory):
 8        self.input_directory = input_directory
 9        self.output_directory = output_directory
10
11    @abstractmethod
12    def read(self):
13        """Read raw data."""
14
15    @abstractmethod
16    def process(self):
17        """Processes raw data. This step should create the raw dataframe with all the required features. Shouldn't implement statistical or text cleaning."""
18
19    @abstractmethod
20    def save(self):
21        """Saves processed data."""
22
23
24 class Trainer(metaclass=ABCMeta):
25    """Base trainer to be used for all models."""
26
27    def __init__(self, directory):
28        self.directory = directory
29        self.model_directory = os.path.join(directory, 'models')
30
31    @abstractmethod
32    def preprocess(self):
33        """This takes the preprocessed data and returns clean data. This is more about statistical or text cleaning."""
34
35    @abstractmethod
36    def set_model(self):
37        """Define model here."""
38
39    @abstractmethod
40    def fit_model(self):
41        """This takes the vectorised data and returns a trained model."""
42
43    @abstractmethod
44    def generate_metrics(self):
45        """Generates metric with trained model and test data."""
46
47    @abstractmethod
48    def save_model(self, model_name):
49        """This method saves the model in our required format."""
50
51
52 class Predict(metaclass=ABCMeta):
53    """Base predictor to be used for all models."""
54
55    def __init__(self, directory):
56        self.directory = directory
57        self.model_directory = os.path.join(directory, 'models')
58
59    @abstractmethod
60    def load_model(self):
61        """Load model here."""
62
63    @abstractmethod
64    def preprocess(self):
65        """This takes the raw data and returns clean data for prediction."""
66
67    @abstractmethod
68    def predict(self):
69        """This is used for prediction."""
70
71
72 class BaseDB(metaclass=ABCMeta):
73    """ Base database class to be used for all DB connectors."""
74    @abstractmethod
75    def get_connection(self):
76        """This creates a new DB connection."""
77    @abstractmethod
78    def close_connection(self):
79        """This closes the DB connection."""

固定随机数种子

实验的可重复性是非常重要的，随机数种子是我们的敌人。要特别注重随机数种子的设置，否则会导致不同的训练 / 测试数据的分裂和神经网络中不同权重的初始化。这些最终会导致结果的不一致。

1 def set_seed(args):
2    random.seed(args.seed)
3    np.random.seed(args.seed)
4    torch.manual_seed(args.seed)
5    if args.n_gpu > 0:
6        torch.cuda.manual_seed_all(args.seed)

先加载少量数据

如果你的数据量太大，并且你正在处理比如清理数据或建模等后续编码时，请使用 nrows来避免每次都加载大量数据。当你只想测试代码而不是想实际运行整个程序时，可以使用此方法。

非常适合在你本地电脑配置不足以处理那么大的数据量，但你喜欢用 Jupyter/VS code/Atom 开发的场景。

1 f_train = pd.read_csv(‘train.csv’, nrows=1000)

预测失败 (成熟开发人员的标志)

总是检查数据中的 NA（缺失值），因为这些数据可能会造成一些问题。即使你当前的数据没有，并不意味着它不会在未来的训练循环中出现。所以无论如何都要留意这个问题。

1 print(len(df))
2 df.isna().sum()
3 df.dropna()
4 print(len(df))

显示处理进度

在处理大数据时，如果能知道还需要多少时间可以处理完，能够了解当前的进度非常重要。

写给 Python 开发者的 10 条机器学习建议

 1 from tqdm import tqdm
 2 import time
 3
 4 tqdm.pandas()
 5
 6 df['col'] = df['col'].progress_apply(lambda x: x**2)
 7
 8 text = ""
 9 for char in tqdm(["a", "b", "c", "d"]):
10    time.sleep(0.25)
11    text = text + char

「方案2：fastprogress」

1 from fastprogress.fastprogress import master_bar, progress_bar
2 from time import sleep
3 mb = master_bar(range(10))
4 for i in mb:
5    for j in progress_bar(range(100), parent=mb):
6        sleep(0.01)
7        mb.child.comment = f'second bar stat'
8    mb.first_bar.comment = f'first bar stat'
9    mb.write(f'Finished loop {i}.')

解决 Pandas 慢的问题

如果你用过 pandas，你就会知道有时候它的速度有多慢ーー尤其在团队合作时。与其绞尽脑汁去寻找加速解决方案，不如通过改变一行代码来使用 modin。

1 import modin.pandas as pd

记录函数的执行时间

并不是所有的函数都生来平等。

即使全部代码都运行正常，也并不能意味着你写出了一手好代码。一些软错误实际上会使你的代码变慢，因此有必要找到它们。使用此装饰器记录函数的时间。

 1 import time
 2
 3 def timing(f):
 4    """Decorator for timing functions
 5    Usage:
 6    @timing
 7    def function(a):
 8        pass
 9    """
10
11
12    @wraps(f)
13    def wrapper(*args, **kwargs):
14        start = time.time()
15        result = f(*args, **kwargs)
16        end = time.time()
17        print('function:%r took: %2.2f sec' % (f.__name__,  end - start))
18        return result
19    return wrapp

不要在云上烧钱

没有人喜欢浪费云资源的工程师。

我们的一些实验可能会持续数小时。跟踪它并在完成后关闭云实例是很困难的。我自己也犯过错误，也看到过有些人会有连续几天不关机的情况。

这种情况经常会发生在我们周五上班，留下一些东西运行，直到周一回来才意识到。

只要在执行结束时调用这个函数，你的屁股就再也不会着火了！

使用 try 和 except 来包裹 main 函数，一旦发生异常，服务器就不会再运行。我就处理过类似的案例

让我们多一点责任感，低碳环保从我做起。

 1 import os
 2
 3 def run_command(cmd):
 4    return os.system(cmd)
 5
 6 def shutdown(seconds=0, os='linux'):
 7    """Shutdown system after seconds given. Useful for shutting EC2 to save costs."""
 8    if os == 'linux':
 9        run_command('sudo shutdown -h -t sec %s' % seconds)
10    elif os == 'windows':
11        run_command('shutdown -s -t %s' % seconds)

创建和保存报告

在建模的某个特定点之后，所有的深刻见解都来自于对误差和度量的分析。确保为自己和上司创建并保存格式正确的报告。

不管怎样，管理层都喜欢报告，不是吗？

 1 import json
 2 import os
 3
 4 from sklearn.metrics import (accuracy_score, classification_report,
 5                             confusion_matrix, f1_score, fbeta_score)
 6
 7 def get_metrics(y, y_pred, beta=2, average_method='macro', y_encoder=None):
 8    if y_encoder:
 9        y = y_encoder.inverse_transform(y)
10        y_pred = y_encoder.inverse_transform(y_pred)
11    return {
12        'accuracy': round(accuracy_score(y, y_pred), 4),
13        'f1_score_macro': round(f1_score(y, y_pred, average=average_method), 4),
14        'fbeta_score_macro': round(fbeta_score(y, y_pred, beta, average=average_method), 4),
15        'report': classification_report(y, y_pred, output_dict=True),
16        'report_csv': classification_report(y, y_pred, output_dict=False).replace('n','rn')
17    }
18
19
20 def save_metrics(metrics: dict, model_directory, file_name):
21    path = os.path.join(model_directory, file_name + '_report.txt')
22    classification_report_to_csv(metrics['report_csv'], path)
23    metrics.pop('report_csv')
24    path = os.path.join(model_directory, file_name + '_metrics.json')
25    json.dump(metrics, open(path, 'w'), indent=4)

写出一手好 API

结果不好，一切都不好。

你可以做很好的数据清理和建模，但是你仍然可以在最后制造巨大的混乱。通过我与人打交道的经验告诉我，许多人不清楚如何编写好的 api、文档和服务器设置。我将很快写另一篇关于这方面的文章，但是先让我简要分享一部分。

下面的方法适用于经典的机器学习和深度学习部署，在不太高的负载下(比如1000 / min)。

见识下这个组合: Fastapi + uvicorn + gunicorn

最快的用 fastapi 编写 API，因为这是最快的，原因参见这篇文章。
文档在 fastapi 中编写 API 为我们提供了 http: url/docs 上的免费文档和测试端点，当我们更改代码时，fastapi 会自动生成和更新这些文档。
worker使用 gunicorn 服务器部署 API，因为 gunicorn 具有启动多于1个 worker，而且你应该保留至少 2 个worker。

运行这些命令来使用 4 个 worker 部署。可以通过负载测试优化 worker 数量。

1 pip install fastapi uvicorn gunicorn
2 gunicorn -w 4 -k uvicorn.workers.UvicornH11Worker main:app

原文来自：http://suo.im/5MoQTN

写给 Python 开发者的 10 条机器学习建议

让我知道你在看

左青龙
微信扫一扫

右白虎
微信扫一扫

写给 Python 开发者的 10 条机器学习建议

学习编写抽象类

固定随机数种子

先加载少量数据

预测失败 (成熟开发人员的标志)

显示处理进度

解决 Pandas 慢的问题

记录函数的执行时间

不要在云上烧钱

创建和保存报告

写出一手好 API

为什么Python是网络安全人最爱的编程语言？

C#编写Windows持久化工具：RedPersist

[代码审计] 某发卡系统首发0day

Python语法简要介绍

PHP反序列化代码审计|由浅入深

两年前的NET通用系统代码审计

Java安全-深入BeanValidation的RCE漏洞

某CMS漏洞审计记录

对最初站点的一次代码审计

魔改frp内网隧道

发表评论

在线咨询

微信