基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例

本文主要是介绍基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例

  • 1.数据载入及处理
  • 2.感知机模型建立
  • 3.模型训练
  • 4.遗传算法进行特征选择
    • 注意
  • 5.联系我们

1.数据载入及处理

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from keras.datasets import imdb
from keras.preprocessing import sequence
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as pltmax_features = 10000
maxlen = 200
batch_size = 32# 加载IMDB数据集
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')# 限定评论长度,并进行填充
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)[:2000]
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)[:2000]
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)# 将整数序列转换为文本
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in input_train[0]])# 使用词袋模型表示文本
vectorizer = CountVectorizer(max_features=max_features)
X_train = vectorizer.fit_transform([' '.join([reverse_word_index.get(i - 3, '?') for i in sequence]) for sequence in input_train])
X_test = vectorizer.transform([' '.join([reverse_word_index.get(i - 3, '?') for i in sequence]) for sequence in input_test])# 转换数据为PyTorch张量
X_train_tensor = torch.tensor(X_train.toarray(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.toarray(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)batch_size = 2000
train_iter = DataLoader(TensorDataset(X_train_tensor, y_train_tensor), batch_size)
test_iter = DataLoader(TensorDataset(X_test_tensor, y_test_tensor), batch_size)

2.感知机模型建立

# 定义感知机网络
class Perceptron(nn.Module):def __init__(self, input_size):super(Perceptron, self).__init__()self.fc = nn.Linear(input_size, 1)self.sigmoid = nn.Sigmoid()def forward(self, x):x = self.fc(x)x = self.sigmoid(x)return x# 训练感知机模型
def train(model, iterator, optimizer, criterion):model.train()for batch in iterator:optimizer.zero_grad()text, label = batchpredictions = model(text).squeeze(1)loss = criterion(predictions, label)loss.backward()optimizer.step()# 测试感知机模型
def evaluate(model, iterator, criterion):model.eval()total_loss = 0total_correct = 0with torch.no_grad():for batch in iterator:text, label = batchpredictions = model(text).squeeze(1)loss = criterion(predictions, label)total_loss += loss.item()rounded_preds = torch.round(predictions)total_correct += (rounded_preds == label).sum().item()return total_loss / len(iterator), total_correct / len(iterator.dataset)# 初始化感知机模型
input_size = X_train_tensor.shape[1]
model = Perceptron(input_size)

3.模型训练

# # 定义损失函数和优化器
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)N_EPOCHS = 10
eval_acc_list = []
for epoch in range(N_EPOCHS):train(model, train_iter, optimizer, criterion)eval_loss, eval_acc = evaluate(model, test_iter, criterion)eval_acc_list.append(eval_acc)print(f'Epoch: {epoch+1}, Test Loss: {eval_loss:.3f}, Test Acc: {eval_acc*100:.2f}%')plt.plot(range(N_EPOCHS), eval_acc_list)
plt.title('Test Accuracy')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.show()

在这里插入图片描述

4.遗传算法进行特征选择

# 随机初始化染色体
def initialize_population(population_size, num_genes):# # Option 1:# p=np.array([0.05,0.95])# return np.random.choice([0, 1], size=(population_size, num_genes), p=p.ravel())# Option 2:return np.random.choice([0, 1], size=(population_size, num_genes))# 计算适应值,以分类器的准确度
def calculate_fitness(population, model, criterion):fitness = []for chromosome in population: # population: a 0-1 sequence selected_features = np.where(chromosome == 1)[0] # 更新模型输入维度input_dim = len(selected_features)model.fc = nn.Linear(input_dim, 1)optimizer = optim.Adam(model.parameters(), lr=0.001)idx = torch.tensor(selected_features)        train_iter = DataLoader(TensorDataset(X_train_tensor[:, idx], y_train_tensor), batch_size)test_iter = DataLoader(TensorDataset(X_test_tensor[:, idx], y_test_tensor), batch_size)# 训练并获取准确度N_EPOCHS = 10for epoch in range(N_EPOCHS):train(model, train_iter, optimizer, criterion)test_loss, test_acc = evaluate(model, test_iter, criterion)model.train() fitness.append(test_acc)return np.array(fitness)# 选择
def selection(population, fitness): # input populations and their accuracyprobabilities = fitness / sum(fitness) # the accuracy-based probability of selection# # Option 1: no random in selection, choose the top 2 as parents# probabilities_copy = probabilities.copy()# probabilities_copy.sort()# max_1 = probabilities_copy[-1]# max_2 = probabilities_copy[-2]# max_1_index = np.where(probabilities == max_1)# max_2_index = np.where(probabilities == max_2)# selected_indices = [max_1_index[0].tolist()[0], max_2_index[0].tolist()[0]] * 25# Option 2: random selected_indices = np.random.choice(range(len(population)), size=len(population), p=probabilities)return population[selected_indices]# 交叉
def crossover(parents, crossover_rate):children = []for i in range(0, len(parents), 2):parent1, parent2 = parents[i], parents[i + 1]if np.random.rand() < crossover_rate:crossover_point = np.random.randint(1, len(parent1))child1 = np.concatenate((parent1[:crossover_point], parent2[crossover_point:]))child2 = np.concatenate((parent2[:crossover_point], parent1[crossover_point:]))else:child1, child2 = parent1, parent2children.extend([child1, child2])return np.array(children)# 变异
def mutation(children, mutation_rate):for i in range(len(children)):mutation_points = np.where(np.random.rand(len(children[i])) < mutation_rate)[0]children[i][mutation_points] = 1 - children[i][mutation_points]  # keyreturn children# 定义遗传算法的主函数
def genetic_algorithm(population_size, num_genes, generations, crossover_rate, mutation_rate, model, criterion):# 初始化染色体population = initialize_population(population_size, num_genes)fitness_list = []for generation in range(generations):print('Generation', generation+1, ":")fitness = calculate_fitness(population, model, criterion) # return a list (1, population_size) with history test acc# 选择selected_population = selection(population, fitness) # return a list, (population_size, num_genes / input_size / sentence_length), each adjacent are parents# 交叉children = crossover(selected_population, crossover_rate)# 变异mutated_children = mutation(children, mutation_rate)# 形成新种群population = mutated_children# 输出当前最优解best_individual = population[np.argmax(fitness)]fitness_list.append(fitness.max())print(f"Generation {generation + 1}, Best Individual: {best_individual}, Fitness: {fitness.max()}")plt.plot(range(generations), fitness_list)plt.title('Test Accuracy with feature selection via genetic algorithm')plt.xlabel('epoch')plt.ylabel('accuracy')plt.show()# 返回最优解best_individual = population[np.argmax(fitness)]return best_individual# 调用遗传算法
model = Perceptron(input_size)
best_solution = genetic_algorithm(population_size=50, num_genes=input_size, generations=10, crossover_rate=0.8, mutation_rate=0.1, model=model, criterion=criterion)
print(f"Final Best Solution: {best_solution}")# 解释最优解
selected_features = np.where(best_solution == 1)[0]
print(f"Selected Features: {selected_features}")
print("Shape of Selected Features = ",selected_features.shape)

在这里插入图片描述

注意

  1. 在本任务中,selection函数中第一个option 1仅选择效果最好的两个染色体作为父母比option 2在population中随机选择的效率更高(10轮次后,验证集精度74%>71%);
  2. 在本任务中,初始化initialize_population函数中指定选择更多的特征(95%, Option 1)比随机选择特征(50%, Option 2)的效率更高;
  3. 每一次基于筛选输入特征的维度修改模型结构参数后,需要注意重申一下 optimizer变量,因为optimizer的声明中涉及model.parameters()

5.联系我们

Email: oceannedlg@outlook.com
在这里插入图片描述

这篇关于基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/529850

相关文章

Springboot3 ResponseEntity 完全使用案例

《Springboot3ResponseEntity完全使用案例》ResponseEntity是SpringBoot中控制HTTP响应的核心工具——它能让你精准定义响应状态码、响应头、响应体,相比... 目录Spring Boot 3 ResponseEntity 完全使用教程前置准备1. 项目基础依赖(M

C++11中的包装器实战案例

《C++11中的包装器实战案例》本文给大家介绍C++11中的包装器实战案例,本文结合实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友参考下吧... 目录引言1.std::function1.1.什么是std::function1.2.核心用法1.2.1.包装普通函数1.2.

Redis 命令详解与实战案例

《Redis命令详解与实战案例》本文详细介绍了Redis的基础知识、核心数据结构与命令、高级功能与命令、最佳实践与性能优化,以及实战应用场景,通过实战案例,展示了如何使用Redis构建高性能应用系统... 目录Redis 命令详解与实战案例一、Redis 基础介绍二、Redis 核心数据结构与命令1. 字符

通过DBeaver连接GaussDB数据库的实战案例

《通过DBeaver连接GaussDB数据库的实战案例》DBeaver是一个通用的数据库客户端,可以通过配置不同驱动连接各种不同的数据库,:本文主要介绍通过DBeaver连接GaussDB数据库的... 目录​一、前置条件​二、连接步骤​三、常见问题与解决方案​1. 驱动未找到​2. 连接超时​3. 权限不

Java中的随机数生成案例从范围字符串到动态区间应用

《Java中的随机数生成案例从范围字符串到动态区间应用》本文介绍了在Java中生成随机数的多种方法,并通过两个案例解析如何根据业务需求生成特定范围的随机数,本文通过两个实际案例详细介绍如何在java中... 目录Java中的随机数生成:从范围字符串到动态区间应用引言目录1. Java中的随机数生成基础基本随

Java领域模型示例详解

《Java领域模型示例详解》本文介绍了Java领域模型(POJO/Entity/VO/DTO/BO)的定义、用途和区别,强调了它们在不同场景下的角色和使用场景,文章还通过一个流程示例展示了各模型如何协... 目录Java领域模型(POJO / Entity / VO/ DTO / BO)一、为什么需要领域模

深入理解Redis线程模型的原理及使用

《深入理解Redis线程模型的原理及使用》Redis的线程模型整体还是多线程的,只是后台执行指令的核心线程是单线程的,整个线程模型可以理解为还是以单线程为主,基于这种单线程为主的线程模型,不同客户端的... 目录1 Redis是单线程www.chinasem.cn还是多线程2 Redis如何保证指令原子性2.

SpringMVC配置、映射与参数处理​入门案例详解

《SpringMVC配置、映射与参数处理​入门案例详解》文章介绍了SpringMVC框架的基本概念和使用方法,包括如何配置和编写Controller、设置请求映射规则、使用RestFul风格、获取请求... 目录1.SpringMVC概述2.入门案例①导入相关依赖②配置web.XML③配置SpringMVC

Mysql利用binlog日志恢复数据实战案例

《Mysql利用binlog日志恢复数据实战案例》在MySQL中使用二进制日志(binlog)恢复数据是一种常见的用于故障恢复或数据找回的方法,:本文主要介绍Mysql利用binlog日志恢复数据... 目录mysql binlog核心配置解析查看binlog日志核心配置项binlog核心配置说明查看当前所

Java中的分布式系统开发基于 Zookeeper 与 Dubbo 的应用案例解析

《Java中的分布式系统开发基于Zookeeper与Dubbo的应用案例解析》本文将通过实际案例,带你走进基于Zookeeper与Dubbo的分布式系统开发,本文通过实例代码给大家介绍的非常详... 目录Java 中的分布式系统开发基于 Zookeeper 与 Dubbo 的应用案例一、分布式系统中的挑战二