用Python开发一个自然语言处理模型，并用Flask进行部署!

2019年5月18日 106次阅读来源: 不谈风月_0eb8

截住到目前为止，我们已经开发了许多机器学习模型，对测试数据进行了数值预测，并测试了结果。实际上，生成预测只是机器学习项目的一部分，尽管它是我认为最重要的部分。今天我们来创建一个用于文档分类、垃圾过滤的自然语言处理模型，使用机器学习来检测垃圾短信文本消息。我们的ML系统工作流程如下：离线训练->将模型作为服务提供->在线预测。

1、通过垃圾邮件和非垃圾邮件训练离线分类器。

2、经过训练的模型被部署为服务用户的服务。

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1554811985723″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《用Python开发一个自然语言处理模型，并用Flask进行部署!》 image

当我们开发机器学习模型时，我们需要考虑如何部署它，即如何使这个模型可供其他用户使用。 Kaggle 和数据科学训练营非常适合学习如何构建和优化模型，但他们并没有教会工程师如何将它们带给其他用户使用，建立模型与实际为人们提供产品和服务之间存在重大差异。

在本文中，我们将重点关注：构建垃圾短信分类的机器学习模型，然后使用 Flask （用于构建Web应用程序的Python微框架）为模型创建API。此API允许用户通过HTTP请求利用预测功能。让我们开始吧！

构建ML 模型

数据是标记为垃圾邮件或正常邮件的SMS消息的集合，可在此处找到。首先，我们将使用此数据集构建预测模型，以准确分类哪些文本是垃圾邮件。朴素贝叶斯分类器是一种流行的电子邮件过滤统计技术。他们通常使用词袋功能来识别垃圾邮件。因此，我们将使用Naive Bayes定理构建一个简单的消息分类器。

<pre spellcheck=”false” style=”box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;”>import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
df = pd.read_csv(‘spam.csv’, encoding=”latin-1″)
df.drop([‘Unnamed: 2’, ‘Unnamed: 3’, ‘Unnamed: 4’], axis=1, inplace=True)
df[‘label’] = df[‘class’].map({‘ham’: 0, ‘spam’: 1})
X = df[‘message’]
y = df[‘label’]
cv = CountVectorizer()
X = cv.fit_transform(X) # Fit the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Naive Bayes Classifier

clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
</pre>

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1554811985730″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《用Python开发一个自然语言处理模型，并用Flask进行部署!》 image

Naive Bayes分类器不仅易于实现，而且提供了非常好的性能。在训练模型之后，我们都希望有一种方法来保持模型以供将来使用而无需重新训练。为实现此目的，我们添加以下行以将我们的模型保存为.pkl文件供以后使用。

我们加载并使用保存的模型：

上述过程称为“标准格式的持久模型”，即模型以特定的开发语言的特定格式持久存储。下一步就是将模型在一个微服务中提供，该服务的公开端点用来接收来自客户端的请求。

将垃圾邮件分类器转换为Web应用程序

在上一节中准备好用于对SMS消息进行分类的代码之后，我们将开发一个Web应用程序，该应用程序由一个简单的Web页面组成，该页面具有允许我们输入消息的表单字段。在将消息提交给Web应用程序后，它将在新页面上呈现该消息，从而为我们提供是否为垃圾邮件的结果。

首先，我们为这个项目创建一个名为SMS-Message-Spam-Detector 的文件夹，这是该文件夹中的目录树，接下来我们将解释每个文件。

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1554811985735″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《用Python开发一个自然语言处理模型，并用Flask进行部署!》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image> <tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1554811985739″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《用Python开发一个自然语言处理模型，并用Flask进行部署!》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image> <tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1554811985741″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《用Python开发一个自然语言处理模型，并用Flask进行部署!》 image

子目录templates是Flask在Web浏览器中查找静态HTML文件的目录，在我们的例子中，我们有两个html文件：home.html和result.html 。

app.py

app.py 文件包含将由Python解释器执行以运行Flask Web应用程序的主代码，还包含用于对SMS消息进行分类的ML代码：

<pre spellcheck=”false” style=”box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;”>from flask import Flask,render_template,url_for,request
import pandas as pd
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.externals import joblib
app = Flask(name)
@app.route(‘/’)
def home():
return render_template(‘home.html’)
@app.route(‘/predict’,methods=[‘POST’])
def predict():
df= pd.read_csv(“spam.csv”, encoding=”latin-1″)
df.drop([‘Unnamed: 2’, ‘Unnamed: 3’, ‘Unnamed: 4’], axis=1, inplace=True)
# Features and Labels
df[‘label’] = df[‘class’].map({‘ham’: 0, ‘spam’: 1})
X = df[‘message’]
y = df[‘label’]

# Extract Feature With CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(X) # Fit the Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
#Alternative Usage of Saved Model
# joblib.dump(clf, 'NB_spam_model.pkl')
# NB_spam_model = open('NB_spam_model.pkl','rb')
# clf = joblib.load(NB_spam_model)
if request.method == 'POST':
    message = request.form['message']
    data = [message]
    vect = cv.transform(data).toarray()
    my_prediction = clf.predict(vect)
return render_template('result.html',prediction = my_prediction)

if name == ‘main‘:
app.run(debug=True)
</pre>

1、我们将应用程序作为单个模块运行，因此我们使用参数初始化了一个新的Flask实例，name是为了让Flask知道它可以在templates所在的同一目录中找到HTML模板文件夹（）。

2、接下来，我们使用route decorator（@app.route(‘/’)）来指定可以触发home 函数执行的URL 。我们的home 函数只是呈现home.htmlHTML文件，该文件位于templates文件夹中。

3、在predict函数内部，我们访问垃圾邮件数据集、预处理文本、进行预测，然后存储模型。我们访问用户输入的新消息，并使用我们的模型对其标签进行预测。

4、我们使用该POST方法将表单数据传输到邮件正文中的服务器。最后，通过debug=True在app.run方法中设置参数，进一步激活Flask的调试器。

5、最后，我们使用run函数执行在服务器上的脚本文件，我们需要确保使用if语句 name == ‘main‘。

home.html

以下是home.html将呈现文本表单的文件的内容，用户可以在其中输入消息：

</div>
</header>
<div class="ml-container">
    <form action="{{ url_for('predict')}}" method="POST">
    <p>Enter Your Message Here</p>
    <!-- <input type="text" name="comment"/> -->
    <textarea name="message" rows="4" cols="50"></textarea>
    <br/>
    <input type="submit" class="btn-info" value="predict">

</form>

</div>

</body>
</html>
view raw
</pre>

style.css文件

在home.html的head部分，我们将加载styles.css文件，CSS文件是用于确定HTML文档的外观和风格的。styles.css必须保存在一个名为的子目录中static，这是Flask查找静态文件（如CSS）的默认目录。

<pre spellcheck=”false” style=”box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;”>body{
font:15px/1.5 Arial, Helvetica,sans-serif;
padding: 0px;
background-color:#f4f3f3;
}
.container{
width:100%;
margin: auto;
overflow: hidden;
}
header{
background:#03A9F4;#35434a;
border-bottom:#448AFF 3px solid;
height:120px;
width:100%;
padding-top:30px;
}
.main-header{
text-align:center;
background-color: blue;
height:100px;
width:100%;
margin:0px;
}

brandname{

float:left;
font-size:30px;
color: #fff;
margin: 10px;

}
header h2{
text-align:center;
color:#fff;
}
.btn-info {background-color: #2196F3;
height:40px;
width:100px;} /* Blue */
.btn-info:hover {background: #0b7dda;}
.resultss{
border-radius: 15px 50px;
background: #345fe4;
padding: 20px;
width: 200px;
height: 150px;
}
</pre>

style.css文件

result.html

我们创建一个result.html文件，该文件将通过函数render_template(‘result.html’, prediction=my_prediction)返回呈现predict，我们在app.py脚本中定义该文件以显示用户通过文本字段提交的文本。result.html文件包含以下内容：

<pre spellcheck=”false” style=”box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;”><!DOCTYPE html>
<html>
<head>
<title></title>
<link rel=”stylesheet” type=”text/css” href=”{{ url_for(‘static’, filename=’css/styles.css’) }}”>
</head>
<body>
<header>
<div class=”container”>
<div id=”brandname”>
ML App
</div>
<h2>Spam Detector For SMS Messages</h2>
</div>
</header>
<p style=”color:blue;font-size:20;text-align: center;”><b>Results for Comment</b></p>
<div class=”results”>

{% if prediction == 1%}
<h2 style="color:red;">Spam</h2>
{% elif prediction == 0%}
<h2 style="color:blue;">Not a Spam (It is a Ham)</h2>
{% endif %}
</div>

</body>
</html>
</pre>

result.html

从result.htm文件我们可以看到一些代码使用通常在HTML文件中找不到的语法例如，{% if prediction ==1%},{% elif prediction == 0%},{% endif %}这是 jinja 语法，它用于访问从HTML文件中请求返回的预测。

我们就要大功告成了！

完成上述所有操作后，你可以通过双击appy.py 或从终端执行命令来开始运行API ：

你应该得到以下输出：

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1554811985777″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《用Python开发一个自然语言处理模型，并用Flask进行部署!》 image

现在你可以打开Web浏览器并导航到 http://127.0.0.1:5000/ ，你应该看到一个简单的网站，内容如下：

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1554811985779″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《用Python开发一个自然语言处理模型，并用Flask进行部署!》 image

恭喜！我们现在以零成本的代价创建了端到端机器学习（NLP）应用程序。如果你回顾一下，其实整个过程根本不复杂。有点耐心和渴望学习的动力，任何人都可以做到。所有开源工具都使每件事都成为可能。

更重要的是，我们能够将我们对机器学习理论的知识扩展到有用和实用的Web应用程序！

    原文作者：不谈风月_0eb8
    原文地址: https://www.jianshu.com/p/e71c5092d61e
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。