本文主要是介绍创新实训舆情分析模型-阶段三,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
这一阶段主要重心在LSTM上,日期是6月21到6月29,最后也是决定用LSTM了,确实适合这个情景。
LSTM
LSTM网上有很多资料了,比如这个写的就还好,可以参考。
https://www.jianshu.com/p/9dc9f41f0b29
第一次尝试
第一次尝试LSTM,写了一个非常简单的小模型,经过测试,遇到了跟深度森林一样的问题,就是数据太少,过拟合。这里是我第一次尝试的代码。
import numpy
import random
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import pandas as pd
import sklearn
import os
from keras.models import Sequential, load_model
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from treeinterpreter import treeinterpreter as tidataframe = pd.read_csv('./dataraw.csv', usecols=[2], engine='python',encoding='utf-8')
dataset = dataframe.values
# 将整型变为float
dataset1 = dataset.astype('float32')scaler = MinMaxScaler(feature_range=(-2, 2))
dataset = scaler.fit_transform(dataset1)def create_dataset(dataset, look_back,sum):
#这里的look_back与timestep相同dataX, dataY = [], []for i in range(len(dataset)-2*look_back):a = dataset[i:(i+look_back)]dataX.append(a)dataY.append(dataset[i + look_back:i+2*look_back])for h in range(sum-len(dataX)):num = range(0, len(dataset))nums = random.sample(num, 2*look_back)nums.sort()dataX.append(dataset[nums[:look_back]])dataY.append(dataset[nums[look_back:]])return numpy.array(dataX),numpy.array(dataY)
#训练数据太少 look_back并不能过大
look_back = 7
datapreX,datapreY = create_dataset(dataset,look_back,5000)
num = int(len(datapreX)*0.8)
trainX,testX,trainY,testY = sklearn.model_selection.train_test_split(datapreX,datapreY,test_size=0.25,shuffle=True)trainX = numpy.reshape(trainX, (-1,look_back))
testX = numpy.reshape(testX, (-1,look_back))
trainY = numpy.reshape(trainY, (-1,look_back))
testY = numpy.reshape(testY, (-1,look_back))rf = RandomForestRegressor(n_estimators=1000,oob_score=True)
rf.fit(trainX, trainY)
print(rf.score(trainX,trainY))
print(rf.score(testX,testY))
testPredict = rf.predict(numpy.reshape(dataset[-7:], (-1,look_back)))
testPredict = numpy.reshape(testPredict, (-1,1))testPredict = scaler.inverse_transform(testPredict)l = list(dataset1)+list(testPredict)dataframe1 = pd.read_csv('dataraw.csv', usecols=[0])
dfd = dataframe1.values
topic = []
date = dfd[-1][0]
date1 = []
pre = []
pla = []
tes = []
testPredict = list(testPredict)
for jk in range(0,len(testPredict)):tes.append(testPredict[jk][0])#date1.append(date[:-2]+str(int(date[-2:])+jk+1))date1.append( str( jk + 1))#date1.append(localtime)topic.append('trump')pre.append('1')pla.append('1')
data = {'ds':date1,'emotion_val':tes,'topic':topic,'predict':pre,'platform':pla}
df = pd.DataFrame(data)
df.to_csv('Result.csv',index=False)
plt.plot(l)
plt.show()
可以看到,模型非常的简单,这也导致了过拟合问题的产生。他的运行结果如下。
很明显,右边预测部分是不可用的。
第二次尝试
第二次尝试,我经过搜集资料得知,想要在小数据集里获得好的结果,不出现直线情况,可以加dropout层失活一部分节点,减小timestep,增加LSTM层数来解决。我的第二次代码如下。
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
import pandas as pd
import os
from keras.models import Sequential, load_model
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_errordataframe = pd.read_csv('./dataraw.csv', usecols=[2], engine='python', skipfooter=3)
dataset = dataframe.values
# 将整型变为float
dataset = dataset.astype('float32')
#归一化 在下一步会讲解
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)train_size = int(len(dataset) * 0.65)
trainlist = dataset[:train_size]
testlist = dataset[train_size:]def create_dataset(dataset, timesteps=36,predict_size=6):#构造数据集datax=[]#构造xdatay=[]#构造yfor each in range(len(dataset)-timesteps - predict_steps):x = dataset[each:each+timesteps,0]y = dataset[each+timesteps:each+timesteps+predict_steps,0]datax.append(x)datay.append(y)return np.array(datax),np.array(datay)
timesteps = 9
predict_steps = 10
trainX,trainY = create_dataset(trainlist,timesteps,predict_steps)
testX,testY = create_dataset(testlist,timesteps,predict_steps)trainX = np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = np.reshape(testX, (testX.shape[0], testX.shape[1] ,1 ))# create and fit the LSTM network
model = Sequential()
model.add(LSTM(128,input_shape=(timesteps,1),return_sequences= True))
model.add(Dropout(0.5))
model.add(LSTM(128,return_sequences=True))
model.add(LSTM(64,return_sequences=False))
model.add(Dense(predict_steps))
model.compile(loss="mean_squared_error",optimizer="adam")
model.fit(trainX,trainY, epochs= 400, batch_size=10)
model.save(os.path.join("DATA","Test" + ".h5"))
# make predictionspredict_xlist = []
predict_y = []#添加预测y列表
predict_xlist.extend(dataset[dataset.shape[0]-timesteps:dataset.shape[0],0].tolist())
while len(predict_y) < 30:predictx = np.array(predict_xlist[-timesteps:])#从最新的predict_xlist取出timesteps个数据,预测新的predict_steps个数据(因为每次预测的y会添加到predict_xlist列表中,为了预测将来的值,所以每次构造的x要取这个列表中最后的timesteps个数据词啊性)predictx = np.reshape(predictx,(1,timesteps,1))#变换格式,适应LSTM模型lstm_predict = model.predict(predictx)predict_xlist.extend(lstm_predict[0])#将新预测出来的predict_steps个数据,加入predict_xlist列表,用于下次预测# invertlstm_predict = scaler.inverse_transform(lstm_predict)predict_y.extend(lstm_predict[0])
l = predict_y
y_ture = np.array(dataset[-30:])
train_score = np.sqrt(mean_squared_error(y_ture,predict_y))
print("train score RMSE: %.2f"% train_score)
dataframe1 = pd.read_csv('dataraw.csv', usecols=[0])
dfd = dataframe1.values
topic = []
date = dfd[-1][0]
date1 = []
pre = []
pla = []
tes = []
testPredict = predict_y
for jk in range(0,len(testPredict)):tes.append(testPredict[jk])date1.append( str( jk + 1))topic.append('trump')pre.append('1')pla.append('1')
data = {'ds':date1,'emotion_val':tes,'topic':topic,'predict':pre,'platform':pla}
df = pd.DataFrame(data)
df.to_csv('Result.csv',index=False)
plt.plot(l)
plt.plot(y_ture)
plt.show()
可以看到,LSTM层数变多了,多了几个dropout层,减小了timestep,这几个方法的结合,让本该过拟合的模型焕发生机。能够较真实的预测未来的情况 。运行结果如下。
上面的图是结果的可视化展示蓝色是预测结果,黄色是真实数据,下面是数据分析,person系数和均方误差。从数据分析可以得到,预测数据和真实数据已经很相近了。最终,我们决定使用这一版模型作为我们的预测模型。
最后,我们为了把系统整合在一起,就把我的程序做成了一个函数,这样,就可以方便的整合到后端系统中去,实现自动化。让关键词的分析预测越发的简单高效。
通过这次实训,我实现了用深度森林和LSTM对十位数的超小规模数据进行训练获得未来预测结果。超小规模数据带来的最大挑战就是过拟合。数据信息严重不足,模型无法总结出一般性普适化规律,所以只能从数据,模型两个角度入手解决问题。在数据方面,我们可以通过各种手段获取新数据,也可以在元数据基础上使用bootstrap等方法创造新的数据集,通过扩大数据集的方式让模型学到更普适信信。从模型角度,对于LSTM,我们要增加层深,加dropout层,减小timestap,从而让模型更不易于过拟合,学到更深入一般性规律。
这篇关于创新实训舆情分析模型-阶段三的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!