机器学习-基于LSTM的情感分析（代码详解）

2019年5月5日 162次阅读来源: vortex

之前写过一篇有关基于LSTM的情感分析的文章，但是那篇文章更多的是在讲理论，代码部分比较少。现在我写篇文章主要讲一下实现的过程，讲解的顺序按照代码执行顺序来。

1.数据预处理

首先需要做的是对数据进行预处理,首先了解一下我们数据的格式：

《机器学习-基于LSTM的情感分析（代码详解）》
数据A-酒店评价数据

数据B-图书评价数据

数据共有两份，A数据为酒店评价，B数据为图书评价。每份数据共包含四个txt文件.

Pos-train.txt : 训练集中的积极类评价
Neg-train.txt : 训练集中的消极类评价
Pos-test.txt : 测试集中的积极类评价
Neg-test.txt : 测试集中的消极类评价

在每个txt文件中，每一行就是一条评价。

《机器学习-基于LSTM的情感分析（代码详解）》
数据展示

了解完数据之后，就开始看代码吧。

代码是从第122行开始执行的，122行之前都是一些包的调用和函数的定义。

timeA=time.time()
word2vec_path = 'word2vec/word2vec.model'
model=gensim.models.Word2Vec.load(word2vec_path)
dimsh=model.vector_size
MAX_SIZE=25
stopWord = makeStopWord()

在这一块我定义了一个时钟，计时A，在程序跑完后计时B，用来计算程序运行总时间。接下来是加载词向量。dimsh为词向量的维度，在这里，词向量的维度为200维。每条评价的长度是不同的。但是在向神经网络输入数据时，应该保持数据shape的一致性。所以我定义了MAX_SIZE，如果一条评价不够25个词，在将评价转成矩阵时，不够的位置用0补齐，超过25个词汇的评价，第25个词以后的词汇都抛弃掉。stopWord为停用词，有些词汇只是用来表示语气的，并不具有实意，这些词统称停用词，在训练时应该去除掉。

trainData, trainSteps, trainLabels = makeData('data/B/Pos-train.txt',
                                              'data/B/Neg-train.txt')
testData, testSteps, testLabels = makeData('data/B/Pos-test.txt',
                                           'data/B/Neg-test.txt')
trainLabels = np.array(trainLabels)

在这一块，我调用了makeData来制作训练集和测试集。在这块先不慌着了解makeData的内部结构，先来了解一下它的参数和返回值。

makeData有两个参数，posPath和negPath，分别为积极评价的路径和消极评价的路径。然后它有三个返回值，为Data, Steps和Labels。Data为由评价和词向量转换来的矩阵，shape为(length,MAX_SIZE,dimsh)。length为数据的大小（即评价的数目），dimsh为词向量的维度，200维。Steps为对应的每条评价的长度。有些评价是用0补齐的，但是我们在训练时，并不需要再考虑这些0，tensorflow自带的RNN支持传入评价的长度。所以在这里返回了steps。Labels即为对应的评价标签。

需要注意的是：在之前的文章中，我们是将每个词向量累加求和得到的评价向量（shape为(1,200)），而在这篇文章中，不再如此操作，而是将每个评价的每个词都作为输入(shape为(1,25,200))。

def makeData(posPath,negPath):
    #获取词汇，返回类型为[[word1,word2...],[word1,word2...],...]
    pos = getWords(posPath)
    print("The positive data's length is :",len(pos))
    neg = getWords(negPath)
    print("The negative data's length is :",len(neg))
    #将评价数据转换为矩阵，返回类型为array
    posArray, posSteps = words2Array(pos)
    negArray, negSteps = words2Array(neg)
    #将积极数据和消极数据混合在一起打乱，制作数据集
    Data, Steps, Labels = convert2Data(posArray, negArray, posSteps, negSteps)
    return Data, Steps, Labels

接下来看一下makeData函数，makeData函数内部是分为三步执行的：

获取词汇，返回的格式为[[word1,word2…],[word1,word2…],…],其中每个列表代表一个评价。
将每条评价都转换为对应的矩阵
将积极数据和消极数据混合在一起打乱，制作数据集

def getWords(file):
    wordList = []
    trans = []
    lineList = []
    with open(file,'r',encoding='utf-8') as f:
        lines = f.readlines()
    for line in lines:
        trans = jieba.lcut(line.replace('\n',''), cut_all = False)
        for word in trans:
            if word not in stopWord:
                wordList.append(word)
        lineList.append(wordList)
        wordList = []
    return lineList

def words2Array(lineList):
    linesArray=[]
    wordsArray=[]
    steps = []
    for line in lineList:
        t = 0
        p = 0
        for i in range(MAX_SIZE):
            if i<len(line):
                try:
                    wordsArray.append(model.wv.word_vec(line[i]))
                    p = p + 1
                except KeyError:
                    t=t+1
                    continue
            else:
               wordsArray.append(np.array([0.0]*dimsh))
        for i in range(t):
            wordsArray.append(np.array([0.0]*dimsh))
        steps.append(p)
        linesArray.append(wordsArray)
        wordsArray = []
    linesArray = np.array(linesArray)
    steps = np.array(steps)
    return linesArray, steps

def convert2Data(posArray, negArray, posStep, negStep):
    randIt = []
    data = []
    steps = []
    labels = []
    for i in range(len(posArray)):
        randIt.append([posArray[i], posStep[i], [1,0]])
    for i in range(len(negArray)):
        randIt.append([negArray[i], negStep[i], [0,1]])
    shuffle(randIt)
    for i in range(len(randIt)):
        data.append(randIt[i][0])
        steps.append(randIt[i][1])
        labels.append(randIt[i][2])
    data = np.array(data)
    steps = np.array(steps)
    return data, steps, labels

这三个函数贴在这里了，大家可以看一下，不再细讲了。如果看着头疼不想看的话，可以直接跳过，知道每个函数的功能就好。

到这里，数据预处理算是完成了，来看一些有关数据格式的基本输出。

《机器学习-基于LSTM的情感分析（代码详解）》
预处理后的数据格式

2.用tensorflow搭建模型

数据预处理完后，接下来就要用tensorflow来搭建模型了。tensorflow是基于计算图来运行的。我们需要先把计算图搭好，每次运行时，我们可以用run函数来向计算图内填充数据和获取一些参数。

graph = tf.Graph()
with graph.as_default():

第一句是用来定义一个计算图，第二句是用来构建计算图。

tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size,MAX_SIZE,dimsh))
tf_train_steps = tf.placeholder(tf.int32,shape=(batch_size))
tf_train_labels = tf.placeholder(tf.float32,shape=(batch_size,output_size))

tf_test_dataset = tf.constant(testData,tf.float32)
tf_test_steps = tf.constant(testSteps,tf.int32)

lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units = num_nodes,
                                         state_is_tuple=True)

 w1 = tf.Variable(tf.truncated_normal([num_nodes,num_nodes // 2], stddev=0.1))
 b1 = tf.Variable(tf.truncated_normal([num_nodes // 2], stddev=0.1))

 w2 = tf.Variable(tf.truncated_normal([num_nodes // 2, 2], stddev=0.1))
 b2 = tf.Variable(tf.truncated_normal([2], stddev=0.1))

训练数据是通过占位符来传入的，每次传入一批（batch_size大小）的数据。在这里我们需要传入的不止有训练数据，还有标签和每条评价对应的长度。

测试数据在计算图内是常量（constant）的形式。

需要注意的是，在这里，我的LSTM是用tensorflow自带的函数来实现的，比自己重复造轮子要方便多了。运算速度也要快很多。推荐大家使用，不要再重复造轮子了。

要使用tensorflow自带的rnn模型，首先也是需要定义参数。在这里我定义了lstm_cell。需要定义它的输出维度，在这里我传入参数num_nodes，num_nodes我设的是128.

w1、b1、w2、b2为普通的矩阵，LSTM的输出值经过w1、b1、w2、b2的变换后就是我们最后的输出值。

    def model(dataset, steps):
        outputs, last_states = tf.nn.dynamic_rnn(cell = lstm_cell,
                                                 dtype = tf.float32,
                                                 sequence_length = steps,
                                                 inputs = dataset)
        hidden = last_states[-1]

        hidden = tf.matmul(hidden, w1) + b1
        logits = tf.matmul(hidden, w2) + b2
        return logits
    train_logits = model(tf_train_dataset, tf_train_steps)
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,
                                                logits=train_logits))
    optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

    test_prediction = tf.nn.softmax(model(tf_test_dataset, tf_test_steps))

接下来来看具体的计算过程。train_logits为最后的输出值，它传入的参数有tf_train_dataset, tf_trainsteps。tf.nn.dynamicrnn为tensorflow自带的rnn函数，是用来执行rnn计算过程的。它需要传入我们定义的lstm_cell，sequencelength（即为每条评价对应的长度）,inputs(数据)。需要注意的是，rnn会在每一个步数产生一个输出值，在这里我们只取最后一个步数的输出值作为输出。last_states的shape为(batch_size,num_nodes,2)。分别为对应的state和最后一个步数的输出值h。我们用last_states[-1]来获取h。经过w1，w2的变换后我们得到logits作为最后的输出。

tf.nn.softmax_cross_entropy_with_logits函数直接完成了softmax和计算交叉熵的操作。然后用tf.reduce_mean计算出了我们的loss。optimizer为优化器，优化目标为loss，学习率为0.01。

test_prediction为在测试集上的预测。

到这里，计算图基本上是搭建完了，接下来就是调用的过程。

num_steps = 20001
summary_frequency = 500


with tf.Session(graph = graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        offset = (step * batch_size) % (len(trainLabels)-batch_size)
        feed_dict={tf_train_dataset:trainData[offset:offset + batch_size],
                   tf_train_labels:trainLabels[offset:offset + batch_size],
                   tf_train_steps:trainSteps[offset:offset + batch_size]}
        _, l = session.run([optimizer,loss],
                           feed_dict = feed_dict)
        mean_loss += l
        if step >0 and step % summary_frequency == 0:
            mean_loss = mean_loss / summary_frequency
            print("The step is: %d"%(step))
            print("In train data,the loss is:%.4f"%(mean_loss))
            mean_loss = 0
            acrc = 0
            prediction = session.run(test_prediction)
            for i in range(len(prediction)):
                if prediction[i][testLabels[i].index(1)] > 0.5:
                    acrc = acrc + 1
            print("In test data,the accuracy is:%.2f%%"%((acrc/len(testLabels))*100))

以上代码为调用计算图的过程，大家可以看一下，不再细讲了。

下面为程序打印出的结果，在测试集上最好的结果达到了97.49%

The step is: 500
In train data,the loss is:0.6195
In test data,the accuracy is:67.50%
The step is: 1000
In train data,the loss is:0.5254
In test data,the accuracy is:73.02%
The step is: 1500
In train data,the loss is:0.4686
In test data,the accuracy is:76.23%
The step is: 2000
In train data,the loss is:0.4113
In test data,the accuracy is:82.10%
The step is: 2500
In train data,the loss is:0.3963
In test data,the accuracy is:83.55%
The step is: 3000
In train data,the loss is:0.3667
In test data,the accuracy is:84.85%
The step is: 3500
In train data,the loss is:0.3327
In test data,the accuracy is:86.01%
The step is: 4000
In train data,the loss is:0.3304
In test data,the accuracy is:88.01%
The step is: 4500
In train data,the loss is:0.2981
In test data,the accuracy is:88.52%
The step is: 5000
In train data,the loss is:0.2922
In test data,the accuracy is:90.57%
The step is: 5500
In train data,the loss is:0.2666
In test data,the accuracy is:89.92%
The step is: 6000
In train data,the loss is:0.2503
In test data,the accuracy is:92.28%
The step is: 6500
In train data,the loss is:0.2426
In test data,the accuracy is:86.66%
The step is: 7000
In train data,the loss is:0.2183
In test data,the accuracy is:93.58%
The step is: 7500
In train data,the loss is:0.2106
In test data,the accuracy is:89.07%
The step is: 8000
In train data,the loss is:0.1884
In test data,the accuracy is:92.18%
The step is: 8500
In train data,the loss is:0.1864
In test data,the accuracy is:95.09%
The step is: 9000
In train data,the loss is:0.1671
In test data,the accuracy is:95.59%
The step is: 9500
In train data,the loss is:0.1490
In test data,the accuracy is:94.23%
The step is: 10000
In train data,the loss is:0.1470
In test data,the accuracy is:95.64%
The step is: 10500
In train data,the loss is:0.1292
In test data,the accuracy is:95.29%
The step is: 11000
In train data,the loss is:0.1179
In test data,the accuracy is:94.73%
The step is: 11500
In train data,the loss is:0.1080
In test data,the accuracy is:88.87%
The step is: 12000
In train data,the loss is:0.0969
In test data,the accuracy is:95.19%
The step is: 12500
In train data,the loss is:0.0888
In test data,the accuracy is:94.78%
The step is: 13000
In train data,the loss is:0.0731
In test data,the accuracy is:95.99%
The step is: 13500
In train data,the loss is:0.0671
In test data,the accuracy is:96.44%
The step is: 14000
In train data,the loss is:0.0531
In test data,the accuracy is:93.08%
The step is: 14500
In train data,the loss is:0.0566
In test data,the accuracy is:96.39%
The step is: 15000
In train data,the loss is:0.0453
In test data,the accuracy is:96.84%
The step is: 15500
In train data,the loss is:0.0456
In test data,the accuracy is:97.14%
The step is: 16000
In train data,the loss is:0.0531
In test data,the accuracy is:95.59%
The step is: 16500
In train data,the loss is:0.0477
In test data,the accuracy is:95.64%
The step is: 17000
In train data,the loss is:0.0313
In test data,the accuracy is:96.79%
The step is: 17500
In train data,the loss is:0.0265
In test data,the accuracy is:96.74%
The step is: 18000
In train data,the loss is:0.0230
In test data,the accuracy is:97.04%
The step is: 18500
In train data,the loss is:0.0210
In test data,the accuracy is:96.94%
The step is: 19000
In train data,the loss is:0.0180
In test data,the accuracy is:96.89%
The step is: 19500
In train data,the loss is:0.0157
In test data,the accuracy is:97.49%
The step is: 20000
In train data,the loss is:0.0171
In test data,the accuracy is:96.64%
time cost: 809

代码和数据已经上传到了我的GitHub。有兴趣的同学可以看一下。

GitHub地址：vortexJCH/-LSTM-

    原文作者：vortex
    原文地址: https://zhuanlan.zhihu.com/p/35133737
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。