CIC-DDoS2019-Detection

2024-06-19 03:44
文章标签 cic ddos2019 detection

本文主要是介绍CIC-DDoS2019-Detection,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

CIC-DDoS2019

CIC-DDoS2019数据集进行检测,本文提供了如下内容:

  • 数据清洗与合并
  • 机器学习模型
  • 深度学习模型
  • PCA,t-SNE分析
  • 数据,结果可视化

代码地址:[daetz-coder](https://github.com/daetz-coder/CIC-DDoS2019-Detection)

1、数据集加载

选择的数据集是这里的csv文件CIC-DDoS2019 (kaggle.com)

image-20240618203852993

链接:https://pan.baidu.com/s/1gP86I08ZQhAOgcfCd5OVVw?pwd=2019 
提取码:2019

2、数据分割

import os
import pandas as pd# 设置包含CSV文件的目录
directory = 'class_split'  # 替换为您的目录路径# 列出目录下所有的CSV文件
csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]# 读取每个CSV文件并打印行数
for csv_file in csv_files:file_path = os.path.join(directory, csv_file)try:# 读取CSV文件data = pd.read_csv(file_path)# 获取行数num_rows = len(data)print(f"{csv_file}: {num_rows} 行")except Exception as e:print(f"无法读取 {csv_file}: {e}")

image-20240618205624924

image-20240618205239085

3、数据可视化

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 加载数据
file_path = './class_split/WebDDoS.csv'
data = pd.read_csv(file_path)# 设置绘图样式
sns.set(style="whitegrid")# 创建一个图形框架
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))# 散点图:流持续时间与前向数据包数量
sns.scatterplot(ax=axes[0], x=data[' Flow Duration'], y=data[' Total Fwd Packets'], color='blue')
axes[0].set_title('Flow Duration vs Total Fwd Packets')
axes[0].set_xlabel('Flow Duration')
axes[0].set_ylabel('Total Fwd Packets')# 箱线图:前向和后向数据包的分布
sns.boxplot(data=data[[' Total Fwd Packets', ' Total Backward Packets']], ax=axes[1])
axes[1].set_title('Distribution of Packet Counts')
axes[1].set_ylabel('Packet Counts')plt.tight_layout()
plt.show()

image-20240618205846107

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np# 加载数据
file_path = './class_split/WebDDoS.csv'
data = pd.read_csv(file_path)# 将时间列转换为 datetime 类型
data[' Timestamp'] = pd.to_datetime(data[' Timestamp'])# 筛选出数值型数据列
numeric_data = data.select_dtypes(include=[np.number])# 设置绘图样式
sns.set(style="whitegrid")# 创建图形框架,一行两列
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))# 时间序列分析:数据包数量随时间的变化
data.sort_values(' Timestamp', inplace=True)
data['Packet Count'] = data[' Total Fwd Packets'] + data[' Total Backward Packets']
data.plot(x=' Timestamp', y='Packet Count', ax=axes[0], title='Packet Count Over Time')# 热图:特征间的相关性
correlation_matrix = numeric_data.corr()
sns.heatmap(correlation_matrix, ax=axes[1])
axes[1].set_title('Feature Correlation Heatmap')plt.tight_layout()
plt.show()

image-20240618205929292

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns# 加载数据
file_path = './class_split/WebDDoS.csv'
data = pd.read_csv(file_path)# 设置绘图样式
sns.set(style="whitegrid")# 创建图形框架,一行两列
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 9))# 小提琴图:前向包大小与反向包大小
sns.violinplot(data=data[[' Total Fwd Packets', ' Total Backward Packets']], ax=axes[0])
axes[0].set_title('Violin Plot of Packet Sizes')# 选择源端口和目的端口的频率最高的前5个端口
top_src_ports = data[' Source Port'].value_counts().nlargest(5)
top_dst_ports = data[' Destination Port'].value_counts().nlargest(5)# 圆饼图:显示源端口的计数
axes[1].pie(top_src_ports, labels=top_src_ports.index, autopct='%1.1f%%', startangle=140)
axes[1].set_title('Pie Chart of Top 5 Source Ports')plt.tight_layout()
plt.show()

violin_pie

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np# 加载数据
file_path = './class_split/WebDDoS.csv'
data = pd.read_csv(file_path)# 数据清洗,处理可能的无穷大或不合理的值
data['Flow Bytes/s'] = pd.to_numeric(data['Flow Bytes/s'], errors='coerce').replace([np.inf, -np.inf], np.nan).fillna(0)# 选择几个数值变量进行分析
selected_columns = [' Flow Duration', ' Total Fwd Packets', ' Total Backward Packets', 'Flow Bytes/s']
selected_data = data[selected_columns]# 设置绘图样式
sns.set(style="whitegrid")# 创建对角线分布图
pair_plot = sns.pairplot(selected_data)
pair_plot.fig.suptitle("Pair Plot of Selected Features", y=1.02)  # 添加总标题并调整位置
plt.savefig("pair_plot.png")
plt.show()

pair_plot

4、数据合并

import pandas as pd
import os# 文件目录
directory = './class_split/'# 文件列表
files = ['BENIGN.csv', 'DrDoS_DNS.csv', 'DrDoS_LDAP.csv', 'DrDoS_MSSQL.csv','DrDoS_NTP.csv', 'DrDoS_NetBIOS.csv', 'DrDoS_SNMP.csv', 'DrDoS_UDP.csv','LDAP.csv', 'MSSQL.csv', 'NetBIOS.csv', 'Portmap.csv','Syn.csv', 'TFTP.csv', 'UDP.csv', 'UDP-lag.csv'
]# 创建空的DataFrame
combined_data = pd.DataFrame()# 对每个文件进行处理
for file in files:file_path = os.path.join(directory, file)# 加载数据data = pd.read_csv(file_path)# 随机选取500条数据sample_data = data.sample(n=500, random_state=1)# 将数据加入到总的DataFrame中combined_data = pd.concat([combined_data, sample_data], ignore_index=True)# 保存到新的CSV文件
combined_data.to_csv('./combined_data.csv', index=False)print("数据合并完成,已保存到combined_data.csv")

对于每一种类型都选择500个样本combined_data.csv

【注:本文提供的csv可满足简单的训练,如果需要更多的数据,可以下载官方数据】

5、机器学习

Logistic
# 训练逻辑回归模型
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print("Logistic Regression Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred_logreg) * 100))
Random Forest

# 训练随机森林模型
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred_rf) * 100))
SVM
# 训练支持向量机模型
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
print("SVM Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred_svm) * 100))
XGBoost

# 训练XGBoost模型
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred_xgb) * 100))# 打印分类报告(以XGBoost为例)
print("\nClassification Report for XGBoost:")
print(classification_report(y_test, y_pred_xgb))
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report# 初始化模型
logreg = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier(n_estimators=100)
svm = SVC()
xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')# 训练逻辑回归模型
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print("Logistic Regression Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred_logreg) * 100))# 训练随机森林模型
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred_rf) * 100))# 训练支持向量机模型
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
print("SVM Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred_svm) * 100))# 训练XGBoost模型
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred_xgb) * 100))# 打印分类报告(以XGBoost为例)
print("\nClassification Report for XGBoost:")
print(classification_report(y_test, y_pred_xgb))
Logistic Regression Accuracy: 54.96%
Random Forest Accuracy: 62.04%
SVM Accuracy: 50.17%
XGBoost Accuracy: 62.75%Classification Report for XGBoost:precision    recall  f1-score   support0       0.99      0.99      0.99       1701       0.50      0.42      0.45       1432       0.31      0.25      0.28       1743       0.56      0.52      0.54       1594       0.99      0.99      0.99       1455       0.45      0.42      0.43       1466       0.60      0.65      0.63       1487       0.46      0.55      0.50       1218       0.36      0.46      0.40       1449       0.54      0.56      0.55       15610       0.38      0.40      0.39       15411       0.40      0.44      0.42       14612       0.99      0.98      0.99       15013       1.00      0.97      0.99       15814       0.51      0.49      0.50       13015       0.92      0.90      0.91       156accuracy                           0.63      2400macro avg       0.62      0.62      0.62      2400
weighted avg       0.63      0.63      0.63      2400
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt# 初始化模型
logreg = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier(n_estimators=100)
svm = SVC()# 训练模型
logreg.fit(X_train, y_train)
rf.fit(X_train, y_train)
svm.fit(X_train, y_train)# 预测结果
y_pred_logreg = logreg.predict(X_test)
y_pred_rf = rf.predict(X_test)
y_pred_svm = svm.predict(X_test)# 混淆矩阵
cm_logreg = confusion_matrix(y_test, y_pred_logreg)
cm_rf = confusion_matrix(y_test, y_pred_rf)
cm_svm = confusion_matrix(y_test, y_pred_svm)# 绘制混淆矩阵的热图
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 6))
sns.heatmap(cm_logreg, annot=True, fmt="d", ax=axes[0], cmap='Blues')
axes[0].set_title('Logistic Regression Confusion Matrix')
axes[0].set_xlabel('Predicted labels')
axes[0].set_ylabel('True labels')sns.heatmap(cm_rf, annot=True, fmt="d", ax=axes[1], cmap='Blues')
axes[1].set_title('Random Forest Confusion Matrix')
axes[1].set_xlabel('Predicted labels')
axes[1].set_ylabel('True labels')sns.heatmap(cm_svm, annot=True, fmt="d", ax=axes[2], cmap='Blues')
axes[2].set_title('SVM Confusion Matrix')
axes[2].set_xlabel('Predicted labels')
axes[2].set_ylabel('True labels')plt.tight_layout()
plt.savefig("confusion.png")
plt.show()

confusion

6、PCA t-SNE

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np# 加载数据
data = pd.read_csv('./combined_data.csv')# 删除不需要的列,例如时间戳或IP地址(假设你的数据集中有这些列)
data.drop([' Timestamp'], axis=1, inplace=True)# 类型转换,将分类标签编码
label_encoder = LabelEncoder()
data[' Label'] = label_encoder.fit_transform(data[' Label'])# 检查并处理无穷大和非常大的数值
data.replace([np.inf, -np.inf], np.nan, inplace=True)  # 将inf替换为NaN
data.fillna(data.median(), inplace=True)  # 使用中位数填充NaN,确保之前中位数计算不包括inf# 特征标准化
scaler = StandardScaler()
X = scaler.fit_transform(data.drop(' Label', axis=1))  # 确保标签列不参与标准化
y = data[' Label']
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)# 可视化 PCA
plt.figure(figsize=(8, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.5)
plt.title('PCA of Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar()
plt.show()# 可视化 t-SNE
plt.figure(figsize=(8, 8))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.5)
plt.title('t-SNE of Dataset')
plt.xlabel('t-SNE Feature 1')
plt.ylabel('t-SNE Feature 2')
plt.colorbar()
plt.show()

image-20240618211011385

image-20240618211021915

7、深度学习

MLP
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset# 定义模型
class NeuralNetwork(nn.Module):def __init__(self, input_size, num_classes):super(NeuralNetwork, self).__init__()self.layer1 = nn.Linear(input_size, 64)self.relu = nn.ReLU()self.layer2 = nn.Linear(64, 64)self.output_layer = nn.Linear(64, num_classes)def forward(self, x):x = self.relu(self.layer1(x))x = self.relu(self.layer2(x))x = self.output_layer(x)return x# 初始化模型
input_size = X_train.shape[1]
num_classes = len(np.unique(y))
model = NeuralNetwork(input_size, num_classes)# 损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

image-20240618211132917

image-20240618211140162

CNN
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset# 定义模型
class CNN(nn.Module):def __init__(self, input_size, num_classes):super(CNN, self).__init__()self.conv1 = nn.Conv1d(1, 16, kernel_size=3, stride=1, padding=1)self.relu = nn.ReLU()self.pool = nn.MaxPool1d(kernel_size=2, stride=2)self.conv2 = nn.Conv1d(16, 32, kernel_size=3, stride=1, padding=1)# 计算池化后的尺寸conv1_out_size = (input_size + 2 * 1 - 3) / 1 + 1  # Conv1pool1_out_size = conv1_out_size / 2  # Pool1conv2_out_size = (pool1_out_size + 2 * 1 - 3) / 1 + 1  # Conv2pool2_out_size = conv2_out_size / 2  # Pool2final_size = int(pool2_out_size) * 32  # conv2 的输出通道数 * 输出长度self.fc = nn.Linear(final_size, num_classes)def forward(self, x):x = x.unsqueeze(1)  # Adding a channel dimensionx = self.relu(self.conv1(x))x = self.pool(x)x = self.relu(self.conv2(x))x = self.pool(x)x = torch.flatten(x, 1)x = self.fc(x)return x# 初始化模型
input_size = X_train.shape[1]
num_classes = len(np.unique(y))
model = CNN(input_size,num_classes)# 损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

image-20240618211231779

image-20240618211238465

这篇关于CIC-DDoS2019-Detection的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1073913

相关文章

CVPR2023检测相关Detection论文速览上

Paper1 AUNet: Learning Relations Between Action Units for Face Forgery Detection 摘要原文: Face forgery detection becomes increasingly crucial due to the serious security issues caused by face manipulati

YOLO: Real-Time Object Detection解读

YOLO不同于RCNN系列分为region proposal和classification,YOLO是直接输出box位置和box所属的类别,整张图都是网络的输入,是个回归问题。 YOLO的主要特点: 速度快,能够达到实时的要求。在 Titan X 的 GPU 上 能够达到 45 帧每秒。使用全图作为 Context 信息,背景错误(把背景错认为物体)比较少。泛化能力强。在自然图像上训练好的

使用tensorflow object detection API实现目标检测

环境 Windows7 x64 conda 4.3.30 1、TensorFlow安装 首先在conda中创建TensorFlow环境 conda create -n tensorflow python=3.6.2 激活tensorflow环境 activate tensorflow 安装tensorflow pip install tensorflow==1.12.0 安

【软件安装11】抓取姿态检测 Grasp Pose Detection (GPD) 与 gpd_ros 安装Ubuntu18.04

文章目录 一、GPD 教程1.1、依赖要求1.2、安装GPD1.3、使用GPD1.3.1 为点云文件生成抓取 1.4、参数1.5、可视1.6、神经网络的输入通道1.7、CNN框架1.8、Network Training1.9、抓取图像/描述符1.10、故障排除提示 二、gpd_ros 教程2.1 安装gps_ros流程:2.2 使用gpd_ros     抓取姿态检测(GPD

ICCV2017《Deep Direct Regression for Multi-Oriented Scene Text Detection》阅读笔记

前言 本文是对《Deep Direct Regression for Multi-Oriented Scene Text Detection》论文的简要介绍和细节分析,由于作者没有放出源码,所以本文没有源码解读的部分,有关的复现工作将在下篇博客介绍。 注:编者水平有限,如有谬误,欢迎指正。若要转载,请注明出处,谢谢。 联系方式: 邮箱:yue_zhan@yahoo.com QQ:11563566

A comprehensive review of machine learning-based models for fake news detection

Abstract     互联网在假新闻传播中的作用使其成为一个严重的问题,需要复杂的技术来自动检测。为了应对 Facebook、Twitter、Instagram 和 WhatsApp 等社交媒体网站上误导性材料的快速传播,本研究探索了深度学习方法和各种分类策略领域。该研究特别调查了基于 Transformer 的模型(如 BERT、递归神经网络 (RNN) 和卷积神经网络 (CNN))在

Image Recognition and Object Detection

原文地址:https://www.learnopencv.com/image-recognition-and-object-detection-part1/ In this part, we will briefly explain image recognition using traditional computer vision techniques. I refer to techniq

Fake news detection: A survey of graph neural network methods

abstract  各种社交网络的出现产生了大量的数据。捕获、区分和过滤真假新闻的有效方法变得越来越重要,特别是在 COVID-19 大流行爆发之后。本研究对假新闻检测系统的图神经网络 (GNN) 的现状和挑战进行了多方面、系统的回顾,并概述了使用 GNN 实现假新闻检测系统的综合方法。此外,还从多个角度讨论了用于实现实用假新闻检测系统的基于 GNN 的先进技术。首先,我们介绍假新闻、假新闻

Character Region Awareness for Text Detection论文学习

​1.首先将模型在Synth80k数据集上训练 Synth80k数据集是合成数据集,里面标注是使用单个字符的标注的,也就是这篇文章作者想要的标注的样子,但是大多数数据集是成堆标注的,也就是每行或者一堆字体被整体标注出来,作者想使用这部分数据集 2.对成行标注的数据集来说,先把成行的文字行切出来,然后用在Synth80k数据集上训练得到的模型推理得到Region score然后再用分水岭算法将单

深度学习tracking学习笔记(3):TLD(Tracking-Learning-Detection)学习与源码理解

reference:http://blog.csdn.net/zouxy09          TLD(Tracking-Learning-Detection)是英国萨里大学的一个捷克籍博士生Zdenek Kalal在其攻读博士学位期间提出的一种新的单目标长时间(long term tracking)跟踪算法。该算法与传统跟踪算法的显著区别在于将传统的跟踪算法和传统的检测算法相结合来解决