本文主要是介绍Python 中文分词并去除停用词,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
import jieba# 创建停用词list
def stopwordslist(filepath):stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]return stopwords# 对句子进行分词
def seg_sentence(sentence):sentence_seged = jieba.cut(sentence.strip())stopwords = stopwordslist('C:\\Users\\hanxi\\PycharmProjects\\Code\\venv\\stopWords2750.txt') # 这里加载停用词的路径outstr = ''for word in sentence_seged:if word not in stopwords:if word != '\t':outstr += wordoutstr += " "return outstrinputs = open('./nlp_baidu.txt', 'r', encoding='utf-8')
outputs = open('./output.txt', 'w')
for line in inputs:line_seg = seg_sentence(line) # 这里的返回值是字符串outputs.write(line_seg + '\n')
outputs.close()
inputs.close()
这篇关于Python 中文分词并去除停用词的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!