lda模型:官方处理方式和自己处理数据对比

2024-05-29 03:36

本文主要是介绍lda模型:官方处理方式和自己处理数据对比,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

自己处理数据,然后分批训练,第一步先对比自己处理的方式和官方是否一致。

官方的代码

import gensim
from gensim import corpora
from gensim.models import LdaModel# 示例数据
documents = ["Human machine interface for lab abc computer applications","A survey of user opinion of computer system response time","The EPS user interface management system","System and human system engineering testing of EPS","Relation of user perceived response time to error measurement","The generation of random binary unordered trees","The intersection graph of paths in trees","Graph minors IV Widths of trees and well quasi ordering","Graph minors A survey"
]# 预处理数据
texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]# 训练 LDA 模型
lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15, random_state=2024)# 打印每个主题的关键词
for idx, topic in lda_model.print_topics(-1):print(f"Topic: {idx}\nWords: {topic}\n")# 推断新文档的主题分布
new_doc = "Human computer interaction"
new_doc_processed = [word for word in new_doc.lower().split()]
new_doc_bow = dictionary.doc2bow(new_doc_processed)
print(new_doc_bow)
print("New document topic distribution:", lda_model.get_document_topics(new_doc_bow))

结果

Topic: 0
Words: 0.078*"graph" + 0.078*"trees" + 0.078*"the" + 0.078*"of" + 0.078*"in" + 0.078*"intersection" + 0.078*"paths" + 0.013*"minors" + 0.013*"interface" + 0.013*"survey"Topic: 1
Words: 0.062*"of" + 0.034*"measurement" + 0.034*"relation" + 0.034*"to" + 0.034*"error" + 0.034*"perceived" + 0.034*"lab" + 0.034*"applications" + 0.034*"for" + 0.034*"machine"Topic: 2
Words: 0.062*"minors" + 0.062*"trees" + 0.062*"the" + 0.062*"binary" + 0.062*"random" + 0.062*"generation" + 0.062*"unordered" + 0.062*"a" + 0.062*"survey" + 0.062*"graph"Topic: 3
Words: 0.134*"system" + 0.073*"human" + 0.073*"eps" + 0.073*"and" + 0.073*"of" + 0.073*"engineering" + 0.073*"testing" + 0.012*"time" + 0.012*"user" + 0.012*"response"Topic: 4
Words: 0.090*"of" + 0.090*"user" + 0.090*"system" + 0.049*"computer" + 0.049*"response" + 0.049*"time" + 0.049*"survey" + 0.049*"a" + 0.049*"interface" + 0.049*"management"[(2, 1), (4, 1)]
New document topic distribution: [(0, 0.066698), (1, 0.7288686), (2, 0.06669144), (3, 0.06943816), (4, 0.068303764)]

print(dictionary.token2id)'''
{'abc': 0, 'applications': 1, 'computer': 2, 'for': 3, 'human': 4, 'interface': 5, 'lab': 6, 'machine': 7, 'a': 8, 'of': 9, 'opinion': 10, 'response': 11, 'survey': 12, 'system': 13, 'time': 14, 'user': 15, 'eps': 16, 'management': 17, 'the': 18, 'and': 19, 'engineering': 20, 'testing': 21, 'error': 22, 'measurement': 23, 'perceived': 24, 'relation': 25, 'to': 26, 'binary': 27, 'generation': 28, 'random': 29, 'trees': 30, 'unordered': 31, 'graph': 32, 'in': 33, 'intersection': 34, 'paths': 35, 'iv': 36, 'minors': 37, 'ordering': 38, 'quasi': 39, 'well': 40, 'widths': 41}
'''print(corpus)'''
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [(5, 1), (13, 1), (15, 1), (16, 1), (17, 1), (18, 1)], [(4, 1), (9, 1), (13, 2), (16, 1), (19, 1), (20, 1), (21, 1)], [(9, 1), (11, 1), (14, 1), (15, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1)], [(9, 1), (18, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)], [(9, 1), (18, 1), (30, 1), (32, 1), (33, 1), (34, 1), (35, 1)], [(9, 1), (19, 1), (30, 1), (32, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)], [(8, 1), (12, 1), (32, 1), (37, 1)]]
'''

自己处理方式


def get_dictionary(input_data):output_dict = {}count = 0for l in input_data:l_list = l.strip().lower().split(" ")sorted_l_list = sorted(l_list)for k in sorted_l_list:if k not in output_dict:output_dict[k] = countcount += 1return output_dictmy_dict = get_dictionary(documents)
print(my_dict)def get_corpus(input_dict, input_data):output_list = []for l in input_data:tmp_dict = {}l_list = l.strip().lower().split(" ")for k in l_list:if k not in tmp_dict:tmp_dict[k] = 0tmp_dict[k] += 1tmp_list = []for k, v in tmp_dict.items():if k in input_dict.keys():tmp_list.append((input_dict[k], v))else:continueoutput_list.append(sorted(tmp_list))return output_listmy_corpus = get_corpus(my_dict, documents)
print(my_corpus)def get_predict_corpus(input_dict, input_data):tmp_dict = {}l_list = input_data.strip().lower().split(" ")for k in l_list:if k not in tmp_dict:tmp_dict[k] = 0tmp_dict[k] += 1tmp_list = []for k, v in tmp_dict.items():if k in input_dict.keys():tmp_list.append((input_dict[k], v))else:continuereturn sorted(tmp_list)'''
{'abc': 0, 'applications': 1, 'computer': 2, 'for': 3, 'human': 4, 'interface': 5, 'lab': 6, 'machine': 7, 'a': 8, 'of': 9, 'opinion': 10, 'response': 11, 'survey': 12, 'system': 13, 'time': 14, 'user': 15, 'eps': 16, 'management': 17, 'the': 18, 'and': 19, 'engineering': 20, 'testing': 21, 'error': 22, 'measurement': 23, 'perceived': 24, 'relation': 25, 'to': 26, 'binary': 27, 'generation': 28, 'random': 29, 'trees': 30, 'unordered': 31, 'graph': 32, 'in': 33, 'intersection': 34, 'paths': 35, 'iv': 36, 'minors': 37, 'ordering': 38, 'quasi': 39, 'well': 40, 'widths': 41}
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [(5, 1), (13, 1), (15, 1), (16, 1), (17, 1), (18, 1)], [(4, 1), (9, 1), (13, 2), (16, 1), (19, 1), (20, 1), (21, 1)], [(9, 1), (11, 1), (14, 1), (15, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1)], [(9, 1), (18, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)], [(9, 1), (18, 1), (30, 1), (32, 1), (33, 1), (34, 1), (35, 1)], [(9, 1), (19, 1), (30, 1), (32, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)], [(8, 1), (12, 1), (32, 1), (37, 1)]]
'''

my_dict == dictionary.token2id'''
True
'''my_corpus == corpus'''
True
'''


# 训练 LDA 模型
my_lda_model = LdaModel(my_corpus, num_topics=5, passes=15, random_state=2024)
print(my_lda_model)# 打印每个主题的关键词
for idx, topic in my_lda_model.print_topics(-1):print(f"Topic: {idx}\nWords: {topic}\n")# 推断新文档的主题分布
new_doc = "Human computer interaction"
new_doc_bow = get_predict_corpus(my_dict, new_doc)
print(new_doc_bow)
print("New document topic distribution:", lda_model.get_document_topics(new_doc_bow))

结果

LdaModel<num_terms=42, num_topics=5, decay=0.5, chunksize=2000>
Topic: 0
Words: 0.078*"32" + 0.078*"30" + 0.078*"18" + 0.078*"9" + 0.078*"33" + 0.078*"34" + 0.078*"35" + 0.013*"37" + 0.013*"5" + 0.013*"12"Topic: 1
Words: 0.062*"9" + 0.034*"23" + 0.034*"25" + 0.034*"26" + 0.034*"22" + 0.034*"24" + 0.034*"6" + 0.034*"1" + 0.034*"3" + 0.034*"7"Topic: 2
Words: 0.062*"37" + 0.062*"30" + 0.062*"18" + 0.062*"27" + 0.062*"29" + 0.062*"28" + 0.062*"31" + 0.062*"8" + 0.062*"12" + 0.062*"32"Topic: 3
Words: 0.134*"13" + 0.073*"4" + 0.073*"16" + 0.073*"19" + 0.073*"9" + 0.073*"20" + 0.073*"21" + 0.012*"14" + 0.012*"15" + 0.012*"11"Topic: 4
Words: 0.090*"9" + 0.090*"15" + 0.090*"13" + 0.049*"2" + 0.049*"11" + 0.049*"14" + 0.049*"12" + 0.049*"8" + 0.049*"5" + 0.049*"17"[(2, 1), (4, 1)]
New document topic distribution: [(0, 0.06669798), (1, 0.72894156), (2, 0.06669143), (3, 0.06936743), (4, 0.06830162)]

这篇关于lda模型:官方处理方式和自己处理数据对比的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1012532

相关文章

Python+FFmpeg实现视频自动化处理的完整指南

《Python+FFmpeg实现视频自动化处理的完整指南》本文总结了一套在Python中使用subprocess.run调用FFmpeg进行视频自动化处理的解决方案,涵盖了跨平台硬件加速、中间素材处理... 目录一、 跨平台硬件加速:统一接口设计1. 核心映射逻辑2. python 实现代码二、 中间素材处

idea设置快捷键风格方式

《idea设置快捷键风格方式》在IntelliJIDEA中设置快捷键风格,打开IDEA,进入设置页面,选择Keymap,从Keymaps下拉列表中选择或复制想要的快捷键风格,点击Apply和OK即可使... 目录idea设www.chinasem.cn置快捷键风格按照以下步骤进行总结idea设置快捷键pyth

Linux镜像文件制作方式

《Linux镜像文件制作方式》本文介绍了Linux镜像文件制作的过程,包括确定磁盘空间布局、制作空白镜像文件、分区与格式化、复制引导分区和其他分区... 目录1.确定磁盘空间布局2.制作空白镜像文件3.分区与格式化1) 分区2) 格式化4.复制引导分区5.复制其它分区1) 挂载2) 复制bootfs分区3)

Spring Boot Interceptor的原理、配置、顺序控制及与Filter的关键区别对比分析

《SpringBootInterceptor的原理、配置、顺序控制及与Filter的关键区别对比分析》本文主要介绍了SpringBoot中的拦截器(Interceptor)及其与过滤器(Filt... 目录前言一、核心功能二、拦截器的实现2.1 定义自定义拦截器2.2 注册拦截器三、多拦截器的执行顺序四、过

MySQL快速复制一张表的四种核心方法(包括表结构和数据)

《MySQL快速复制一张表的四种核心方法(包括表结构和数据)》本文详细介绍了四种复制MySQL表(结构+数据)的方法,并对每种方法进行了对比分析,适用于不同场景和数据量的复制需求,特别是针对超大表(1... 目录一、mysql 复制表(结构+数据)的 4 种核心方法(面试结构化回答)方法 1:CREATE

Go异常处理、泛型和文件操作实例代码

《Go异常处理、泛型和文件操作实例代码》Go语言的异常处理机制与传统的面向对象语言(如Java、C#)所使用的try-catch结构有所不同,它采用了自己独特的设计理念和方法,:本文主要介绍Go异... 目录一:异常处理常见的异常处理向上抛中断程序恢复程序二:泛型泛型函数泛型结构体泛型切片泛型 map三:文

详解C++ 存储二进制数据容器的几种方法

《详解C++存储二进制数据容器的几种方法》本文主要介绍了详解C++存储二进制数据容器,包括std::vector、std::array、std::string、std::bitset和std::ve... 目录1.std::vector<uint8_t>(最常用)特点:适用场景:示例:2.std::arra

C++,C#,Rust,Go,Java,Python,JavaScript的性能对比全面讲解

《C++,C#,Rust,Go,Java,Python,JavaScript的性能对比全面讲解》:本文主要介绍C++,C#,Rust,Go,Java,Python,JavaScript性能对比全面... 目录编程语言性能对比、核心优势与最佳使用场景性能对比表格C++C#RustGoJavapythonjav

SpringBoot返回文件让前端下载的几种方式

《SpringBoot返回文件让前端下载的几种方式》文章介绍了开发中文件下载的两种常见解决方案,并详细描述了通过后端进行下载的原理和步骤,包括一次性读取到内存和分块写入响应输出流两种方法,此外,还提供... 目录01 背景02 一次性读取到内存,通过响应输出流输出到前端02 将文件流通过循环写入到响应输出流

java敏感词过滤的实现方式

《java敏感词过滤的实现方式》文章描述了如何搭建敏感词过滤系统来防御用户生成内容中的违规、广告或恶意言论,包括引入依赖、定义敏感词类、非敏感词类、替换词类和工具类等步骤,并指出资源文件应放在src/... 目录1.引入依赖2.定义自定义敏感词类3.定义自定义非敏感类4.定义自定义替换词类5.最后定义工具类