BOW模;型CountVectorizer模型;tfidf模型;

本文主要是介绍BOW模;型CountVectorizer模型;tfidf模型;，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

自然语言入门

一、BOW模型：使用一组无序的单词来表达一段文字或者一个文档，并且每个单词的出现都是独立的。在表示文档时是二值（出现1，不出现0）；

eg:

Doc1:practice makes perfect perfect.

Doc2:nobody is perfect.

Doc1和Doc2作为语料库：词有（practice makes perfect nobody is）

Doc1用BOW模型向量表示为：[1,1,1,0,0]

Doc2用BOW模型向量表示为：[0,0,1,1,1]

二、CountVectorizer模型：使用一组无序的单词来表达一段文字或者一个文档，并且每个单词的出现都是独立的。在表示文档时是每个词在相应文档中出现的次数；

eg:

Doc1:practice makes perfect perfect.

Doc2:nobody is perfect.

Doc1和Doc2作为语料库：词有（practice makes perfect nobody is）

Doc1用BOW模型向量表示为：[1,1,2,0,0]

Doc2用BOW模型向量表示为：[0,0,1,1,1]

CountVectorizer调用sklearn实现，sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
'''停用词表'''
stop_list='is a the of'.split()'''申明CountVectorizer模型'''
cnt = CountVectorizer(min_df=1,ngram_range=(1,2),stop_words=stop_list)'''语料库'''
corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?',]'''将文本转化为特征'''
X =cnt.fit_transform(corpus)
print(type(X)) #输出<class 'scipy.sparse.csr.csr_matrix'> 
print(X)'''取得特征词列表[]'''
print(cnt.get_feature_names())'''
输出
['and', 'and third', 'document', 'first', 'first document', 'one', 'second', 
'second document','second second', 'third', 'third one', 'this', 'this first', 'this second']
''''''获取得到的词'''
print(cnt.vocabulary_)'''
输出：
{'this': 11, 'first': 3, 'document': 2, 'this first': 12, 'first document': 4, 'second': 6, 
'this second': 13, 'second second': 8, 'second document': 7, 'and': 0, 'third': 9, 'one': 5, 
'and third': 1, 'third one': 10}
'''

三、tfidf模型（可以调sklearn包实现tfidf的特征提取from sklearn.feature_extraction.text import TfidfVectorizer）

1、计算词频tf

　　词频（TF） = 某个词在文章中的出现次数 / 文章总词数

2、计算逆文档数idf

逆文档频率（IDF） = log（语料库的文档总数/包含该词的文档总数）

3、计算TF-IDF

TF-IDF = 词频（TF) * 逆文档频率（IDF）

实现代码见git：https://github.com/frostjsy/my_study/blob/master/nlp/feature_extract/tf_idf.py

TfidfVectorizer调用sklearn实现，sklearn.feature_extraction.text import TfidfVectorizer

'''申明TfidfVectorizer模型，用法类似于CountVectorizer'''
tfidf=TfidfVectorizer(ngram_range=(1,2),stop_words=stop_list)
x=tfidf.fit_transform(corpus)
print(tfidf.get_feature_names())
print(tfidf.vocabulary_)

这篇关于BOW模;型CountVectorizer模型;tfidf模型;的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！