NLP中的红楼梦

本文主要是介绍NLP中的红楼梦，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

兜兜转转学NLP学了一个月，结果还在皮毛上，今天打算使用NLP对自己喜欢的红楼梦进行梳理。

这篇文章的目的，建立红楼梦的知识库

1、主要人物说话关键字提取

2、

一、建立语料库

语料库是以后我们分词以及建立模型的基础，我们将红楼梦各章节的内容以一句话一行的形式建立语料库。

└─data         # 根目录└─chapters         # 存放文档01.txt02.txt03.txt04.txt05.txt06.txt07.txt│└─corpus         # 存放语料01.txt02.txt03.txt04.txt05.txt06.txt07.txt

#construct_corpus.py
import re
import matplotlib.pyplot as plt
import pandas
from itertools import chain
#defaultdict的作用是在于，当字典里的key不存在但被查找时，返回的不是keyError而是一个默认值
from collections import defaultdict
from string import punctuation# 定义要删除的标点等字符
add_punc='，。、【 】 “”：；（）《》‘’{}？！⑦()、%^>℃：.”“^-——=&#@￥『』'
all_punc=punctuation+add_punc
import os
os.chdir('D:/good_study/NLP/红楼梦/')
chapters_path = 'D:/good_study/NLP/红楼梦/data/chapters/'
corpus_path = 'D:/good_study/NLP/红楼梦/data/corpus/'
#/*-----------------------------------------------*/
#/* 1、各章一句话一行的形式建立语料库
#/*-----------------------------------------------*/
# 处理得到所有章节地址列表
listdir = os.listdir(chapters_path)
# listdir=listdir[:9]
#所有章节的每句话列表
sentences_all_list = []
for filename in listdir:print("正在处理第{}章节".format(filename))chapters_root_path = chapters_path + str(filename)#每个章节的每句话列表sentences_list = []with open(chapters_root_path,'r', encoding='utf8') as f:for line in f.readlines():# 把元素按照[。！；？]进行分隔，得到句子。line_split = re.split(r'[，。！；？]',line.strip())# [。！；？]这些符号也会划分出来，把它们去掉。line_split = [line.strip() for line in line_split if line.strip() not in ['。','！','？','；'] and len(line.strip())>1]#移除英文和数字line_split = [re.sub(r'[A-Za-z0-9]|/d+','',line) for line in line_split]# #移除标点符号line_split = [''.join(list(filter(lambda ch: ch not in all_punc, line) )) for line in line_split]sentences_list.append(line_split)# print("="*30)#chain.from_iterable 将嵌套的列表无缝连接在一起sentences_list = list(chain.from_iterable(sentences_list))sentences_all_list.append(sentences_list)corpus_root_path = corpus_path + str(filename)with open(corpus_root_path,"w", encoding='utf8') as f:for line in sentences_list:f.write(line)f.write('\n')#构建全书语料库
sentences_all_list = list(chain.from_iterable(sentences_all_list))
corpus_root_path=corpus_path+'whole_book.txt'
with open(corpus_root_path,"w", encoding='utf8') as f:for line in sentences_all_list:f.write(line)f.write('\n')#/*-----------------------------------------------*/
#/* 2、分析各章字数
#/*-----------------------------------------------*/
# 处理得到所有章节地址列表
listdir = os.listdir(corpus_path)
line_words_list=[]
chapter_list=[]
# listdir=listdir[:9]
#所有章节的每句话列表
for filename in listdir:corpus_root_path = corpus_path + str(filename)#提取章节数字num = int(re.findall('\d+',filename)[0])chapter_list.append(num)with open(corpus_root_path,"r", encoding='utf8') as f:line_words=0for line in f.readlines():line_words+=len(line)line_words_list.append(line_words)print("{}章节，共{}字，验证章节{}".format(filename,line_words,num))chapter_words=pandas.DataFrame({'chapter':chapter_list,'chapter_words':line_words_list})chapter_words.sort_values(by='chapter',ascending=True, inplace=True)
chapter_words = chapter_words.set_index(keys=['chapter'])
chapter_words['chapter_words'].plot(kind='bar',color = 'g',alpha = 0.5,figsize = (20,15)) 
plt.show()

处理好语料后，统计全书字数为82万，各章节字数如下图所示，每章平均字数在7000左右，字数和故事情节一样，有抑扬顿挫的节奏感，中间57-78章节字数略有高峰，也是小说中宝黛爱情走向高峰、各种人物风波矛盾纠缠迭起的时候。

参考资料：点此链接

《红楼梦》汉英平行语料库：http://corpus.usx.edu.cn/hongloumeng/images/shiyongshuoming.htm

现代汉语＋古代汉语语料库在线检索系统:http://ccl.pku.edu.cn:8080/ccl_corpus/index.jsp?dir=xiandai

二、分词，建立红楼梦词库

分词方法分规则分词和统计分析，目前我们还没有红楼梦的词库，目前通用的汉语NLP工具均以现代汉语为核心语料，对古代汉语的处理效果很差，从网上找了甲言这个包，甲言，取「甲骨文言」之意，是一款专注于古汉语处理的NLP工具包。

当前版本支持词库构建、自动分词、词性标注、文言句读和标点五项功能，更多功能正在开发中。

Windows上pip install kenlm报错解决：点此链接

2.1 HMM

2.2 CRF

2.3 衡量分词的一致性

三、命名实体识别
四、每章摘要
五、每章内容概述
六、每章内容标签
七、红楼梦的社交网络
八、每章内容概述
九、每章内容概述
十、每章内容概述

未完待续...

这篇关于NLP中的红楼梦的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

NLP中的红楼梦

一、建立语料库

二、分词，建立红楼梦词库

2.1 HMM

2.2 CRF

2.3 衡量分词的一致性

三、命名实体识别
四、每章摘要
五、每章内容概述
六、每章内容标签
七、红楼梦的社交网络
八、每章内容概述
九、每章内容概述
十、每章内容概述

相关文章

Python Transformers库(NLP处理库)案例代码讲解

Python实现NLP的完整流程介绍

【python 走进NLP】两两求相似度，得到一条文本和其他文本最大的相似度

【Python 走进NLP】NLP词频统计和处理停用词，可视化

【java 走进NLP】simhash 算法计算两篇文章相似度

【python 走进NLP】simhash 算法计算两篇文章相似度

【python 走进NLP】文本相似度各种距离计算

【python 走进NLP】句子相似度计算--余弦相似度

【python 走进NLP】从零开始搭建textCNN卷积神经网络模型

NLP文本相似度之LCS

NLP中的红楼梦

一、建立语料库

二、分词，建立红楼梦词库

2.1 HMM

2.2 CRF

2.3 衡量分词的一致性

三、命名实体识别 四、每章摘要 五、每章内容概述 六、每章内容标签 七、红楼梦的社交网络 八、每章内容概述 九、每章内容概述 十、每章内容概述

相关文章

三、命名实体识别
四、每章摘要
五、每章内容概述
六、每章内容标签
七、红楼梦的社交网络
八、每章内容概述
九、每章内容概述
十、每章内容概述