文章分词/jieba的应用

本文主要是介绍文章分词/jieba的应用，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

1.将字符串中的单词找出，并输出

str1 = "The life is short,you need python"
str1.split()
print(str1.split())['The', 'life', 'is', 'short,you', 'need', 'python']

2.jieba：中文第三方库

pip install jieba(CMD)  //jieba安装

3.jieba分词原理
**依靠中文词库确定汉字之间的组成概率
**汉字之间组成频率大的结果，输出形成分词
**除了分词，还可以自定义添加分词

4.模式类型及描述
精确模式——文本精确分开，不存在冗余单词
全模式——文本中可能存在的词语均会分开，可能会存在冗余单词
搜索引擎模式——在精准分词的基础上，对长词再次进行切割

5.jieba库常用函数
1>精确模式 -jieba.lcut()

 import jieba
c=jieba.lcut("中国是一个伟大的国家")
print(c)['中国', '是', '一个', '伟大', '的', '国家']

2>全模式 -jieba.lcut(s,True)

import jieba
c=jieba.lcut("中国是一个伟大的国家",cut_all = "True")
print(c)
['中国', '国是', '一个', '伟大', '的', '国家']

3>搜索引擎模式 -jieba.lcut_for_search(ss)

import jieba
c=jieba.lcut_for_search("中华人民共和国是伟大的")
print(c)['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '伟大', '的']

4>jieba.add_word(s)

import jieba
c=jieba.add_word("帝光锡华")
print(c)

5>利用分词统计三国演义人物出场次数

I
文件 ->长字符串；read() 定义空字典；counts = {} P(操作)：
使用jieba库将文章进行分词，放到列表中，然后遍历列表中的每个词组，同时判断该词组是否在定义的字典中，如果存在，则其计数值+1，否则，将该此作为键值，加入到字典中。
添加

-修改字典元素的方法：counts[key] = 1修改字典元素值的方法：counts[key] = counts[key] +1

“---------------------------------------------------------------------------------------”

import jieba
text = open("三国演义.txt","r",encoding = "utf-8").read()
words = jieba.lcut(text)
counts = {}
for word in words:if len(word) == 1:continueelse:counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1],reserve = True)
for i in range(15):word,count = items[i]print("{0:<10}{1:>5}".format(word,count))

“------------------------------------------------------------------------------------”

import jieba
excludes = {"将军"，“却说”，“荆州”，“？？？”，“？？？”}
text = open("三国演义"，"r",enconding="utf-8").read()
words = jieba.lcut(text)
counts = {}
for word in words:
if len(word==1):continue
elif word =="诸葛亮 "  or word == “孔明曰"：reword  == "孔明"
elif word =="云长 "  or word == “关公"：reword  == "关羽"elif word =="玄德"  or word == “玄德曰"：reword  == "刘备”elif word =="孟德 "  or word == “丞相"：reword  == "曹操"
else:rword = wordcounts[word] = counts.get(rword,0) + 1for word in excludes:del(counts[word])items = list(counts.items())items.sort(key = lambda x:x[1],reverse = True)for i in range(5)word,count = items[i]print("{0:<10}{1:>5}".format(word,count))

<<百年孤独>>

这篇关于文章分词/jieba的应用的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！