词袋模型两个代码例子

本文主要是介绍词袋模型两个代码例子，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

代码1

import numpy as np
import pandas as pd
texts = ['i have a melon','you have a banana','you and i have a melon and a banana']vocabulary = list(enumerate(set([word for sentencein texts for word in sentence.split()])))
print('Vocabulary:', vocabulary)def vectorize(text):vector = np.zeros(len(vocabulary))for i, word in vocabulary:num = 0for w in text:if w == word:num += 1if num:vector[i] = numreturn vectorprint('Vectors:')
for sentence in texts:print(vectorize(sentence.split()))

Vocabulary: [(0, 'a'), (1, 'have'), (2, 'and'), (3, 'melon'), (4, 'banana'), (5, 'i'), (6, 'you')]
Vectors:
[1. 1. 0. 1. 0. 1. 0.]
[1. 1. 0. 0. 1. 0. 1.]
[2. 1. 2. 1. 1. 1. 1.]

代码2

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts = ['i have a melon','you have a banana','you and i have a melon and a banana']
# 将所有文本转换为小写
#texts = [text.lower() for text in texts]# 使用 CountVectorizer 来构建词汇表和向量化
# 把i和a这种一个字母的词也算在内用token_pattern=r'(?u)\b\w+\b'
count = CountVectorizer(token_pattern=r'(?u)\b\w+\b')bag = count.fit_transform(texts)print('Vocabulary:',count.vocabulary_)print('Vectors:')
print(bag.toarray())

Vocabulary: {'i': 4, 'have': 3, 'a': 0, 'melon': 5, 'you': 6, 'banana': 2, 'and': 1}
Vectors:
[[1 0 0 1 1 1 0][1 0 1 1 0 0 1][2 2 1 1 1 1 1]]

第一段代码手动构建了一个词汇表，并定义了一个 vectorize 函数来将文本转换为向量。这个函数统计每个单词在句子中出现的次数，并将结果存储在一个与词汇表长度相同的数组中。

第二段代码使用了 CountVectorizer 类，这是 scikit-learn 库提供的一个工具，用于将文本数据转换为词袋模型。CountVectorizer 自动构建一个词汇表，并且统计每个单词在文档中出现的次数。在 CountVectorizer 中，默认情况下，单个字母不会被当作单词来处理，因为它们通常被视为停用词。如果希望将单个字母也包括在内，可以通过设置 token_pattern 参数来实现。默认的 token_pattern 是 '(?u)\b\w\w+\b'，这意味着它将匹配边界之间的单词，且单词至少有两个字符长。在 CountVectorizer 中，词汇表（vocabulary_）是根据它首次出现在文档中的顺序来构建的，而不是按照字母顺序或出现频率。

这篇关于词袋模型两个代码例子的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！