用NLTK对英文语料做预处理，用gensim计算相似度

本文主要是介绍用NLTK对英文语料做预处理，用gensim计算相似度，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

“这篇是研一自己摸索的代码，当时就有点过时，但还是具有一定的参考价值。仅作记录，意义不大。”——题记

来自这里

提示性信息很赞

参考52nlp（三）

（二）

（一）

对所有语料进行分词（tokenizing）和词干化（stemming）

利用 tf-idf 将语料库转换为向量空间（vector space）
计算每个文档间的余弦距离（cosine distance）用以测量相似度
利用 k-means 算法进行文档聚类
利用多维尺度分析（multidimensional scaling）对语料库降维
利用 matplotlib 和 mpld3 绘制输出的聚类
对语料库进行Ward 聚类算法生成层次聚类（hierarchical clustering）
绘制 Ward 树状图（Ward dendrogram）
利用隐含狄利克雷分布（LDA）进行主题建模

引入NLTK（著名的Python自然语言处理工具包），还要先装好依赖NumPy和PyYAML

之后下载NLTK官方提供的相关语料：（不然会报错）

>>> import nltk

>>> nltk.download()

这个时候会弹出一个图形界面，会显示两份数据供你下载，分别是all-corpora和book，最好都选定下载了，这个过程需要一段时间，语料下载完毕后，NLTK在你的电脑上才真正达到可用的状态。

利用tf-idf将语料库转换为向量空间

a. 数据准备

b. 引入nltk，测试，开始处理数据

c. 引入nltk的word_tokenize函数，对文本进行分词

d. 根据nltk提供的英文停用词数据去停用词

e. 过滤标点符号，首先要定义一个标点符号list

f. 对英文单词进行词干化（Stemming)，NLTK提供了好几个相关工具接口可供选择，具体参考这个页面:http://nltk.org/api/nltk.stem.html, 可选的工具包括Lancaster Stemmer,Porter Stemmer等知名的英文Stemmer。这里我们使用LancasterStemmer

f1.去掉低频词

>>> all_stems = sum(texts_stemmed, [ ])

>>> stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)

>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]

事实证明这非常有必要，因为低频词在tfidf模型中会被处理为value=0，还是会被忽略。

g.通过文档抽取一个“词袋（bag-of-words)"；将文档的token映射为id

【LSI,英文：Latent Semantic Indexing的缩写，中文意译是潜在语义索引，指的是通过海量文献找出词汇之间的关系。当两个词或一组词大量出现在一个文档中时，这些词之间就可以被认为是语义相关的。

“词袋”模型假设一个段落的词汇之间出现频率是无关联的。因此通过给文档建立文档词汇向量表维度很大，并且有数据稀疏问题，通过LSI建模，通过大量的统计，得出相关词汇构成一个潜在的主题，本质是给词汇聚类，达到降维的目的。】

【利用gensim.corpora.dictionary.Dictionary类为每个出现在语料库中的单词分配了一个独一无二的整数编号。这个操作收集了单词计数及其他相关的统计信息】

h. 将用字符串表示的文档转换为用id表示的文档向量

i.基于这些“训练文档”（语料库）计算一个TF-IDF“模型”http://radimrehurek.com/gensim/tut2.html

j.将用词频表示文档向量表示为一个用tf-idf值表示的文档向量

（k.训练一个LSI模型；有了LSI模型，我们就可以将文档映射到一个二维的topic空间中

i.建立基于LSI模型的索引，通过LSI模型将特定文档映射到 n 个topic主题模型空间上，然后和其他文档计算相似度 http://www.cnblogs.com/pinard/p/6805861.html

m. 对相似度进行排序）

import nltk
from nltk.tokenize import word_tokenizetext = open('F:/iPython/newsfortfidf.txt')
# testtext = [line.strip() for line in file('text')]
testtextt = [course.split("###") for text in testtext]
print testtexttexts_tokenized = [[word for word in nltk.word_tokenize(testtext)]]
print texts_tokenizedfrom nltk.corpus import stopwords
english_stopwords = stopwords.words('english')
print english_stopwords
len(english_stopwords)
texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
print texts_filtered_stopwordsenglish_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','-']
texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
print texts_filteredfrom nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
print texts_stemmedall_stems = sum(texts_stemmed, [])
stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]from gensim import corpora,models,similarities          #   http://blog.csdn.net/questionfish/article/details/46715795import logging                                                         #通过logging.basicConfig函数对日志的输出格式及方式做相关配置
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
dictionary = corpora.Dictionary(texts)                               #为每个出现在语料库中的单词分配了一个独一无二的整数编号
print dictionary
print dictionary.token2id                                                   #查看单词与编号之间的映射关系corpus = [dictionary.doc2bow(text) for text in texts]           #函数doc2bow()简单地对每个不同单词的出现次数进行了计数，并将      
print corpus                                                                    #单词转换为其编号，然后以稀疏向量的形式返回结果tfidf = models.TfidfModel(corpus)   #补充tf-idf：http://www.ruanyifeng.com/blog/2013/03/tf-idf.htmlcorpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:print docprint tfidf.dfs    #同idfs，也是个字典，每个key的value代表的是该单词在多少文档曾经出现过
print tfidf.idfs   #数据的字典，每个数据的value代表该单词对于该篇文档的代表性大小，即：如果该单词在所有的文章中均出现，#说明毫无代表作用，该处value为0,而如果该单词在越少的文章中出现，则代表该单词对于该文档有更强的代表性lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)
corpus_lsi = lsi[corpus_tfidf]
for doc in corpus_lsi:
... print docindex = similarities.MatrixSimilarity(lsi[corpus])>>> print courses_name[210]
Machine Learning>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1]

中文完整示例

示例同上

修正版
def text_pre(text_raw):import nltkfrom nltk.tokenize import word_tokenizetexts_tokenized = [[word for word in nltk.word_tokenize(text_raw)]]from nltk.corpus import stopwordsenglish_stopwords = stopwords.words('english')texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','-',"``","''","'s","--","'"]texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]from nltk.stem.lancaster import LancasterStemmerst = LancasterStemmer()texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]   all_stems = sum(texts_stemmed, [])stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]return textstext_raw = open('F:/iPython/newsfortfidf.txt').readline()text0 = text_pre(text_raw)
text0 = text0[0]
text1,text2......
texts = texts = [texts0,texts1,texts2,texts3]from gensim import corpora,models,similaritiesimport logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)dictionary = corpora.Dictionary(texts)
print dictionary.token2idcorpus = [dictionary.doc2bow(text) for text in texts]
print corpustfidf = models.TfidfModel(corpus)
doc_row = [(0,1),(1,1),(7,2),(8,3)]  #此处的doc_row是同一向量空间中的一篇文档的向量表示
corpus_tfidf0 = tfidf[doc_row]
for doc in corpus_lsi:
... print doclsi_model = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
index = similarities.MatrixSimilarity(lsi[corpus])      #基于LSI模型的文档索引
之后同

4.计算每个文档间的余弦距离（cosine distance）用以测量相似度

① 计算相似度，然后写入txt 文档中

index = similarities.MatrixSimilarity(corpus_tfidf) #把所有评论做成索引
sims = index[vec_tfidf]  #利用索引计算每一条评论和商品描述之间的相似度
similarity = list(sims)  #把相似度存储成数组，以便写入txt 文档
sim_file = open(storepath, 'w')
for i in similarity:sim_file.write(str(i)+'\n')  #写入txt 时不要忘了编码
sim_file.close()