MapReduce 实践题：Web 访问日志分析与异常检测

本文主要是介绍MapReduce 实践题：Web 访问日志分析与异常检测，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

文章目录

- 作业描述
- - MapReduce 实践题：Web 访问日志分析与异常检测
  - - 题目背景
    - 数据集说明
    - 任务要求
    - 输入数据示例
    - 输出数据示例
    - 实现步骤
  - 解题思路
  - 1. 数据预处理
  - 2. 访问统计
  - 3. 异常检测
  - 4. 主方法
  - 5. 结果输出

作业描述

MapReduce 实践题：Web 访问日志分析与异常检测

题目背景

你被要求设计和实现一个基于 MapReduce 的大规模 Web 访问日志分析与异常检测系统。该系统的目标是从每日数百万条访问日志中提取有用的信息，并检测出潜在的异常访问行为。访问日志文件格式如下：

127.0.0.1 - - [10/Oct/2021:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 1043
192.168.0.1 - - [10/Oct/2021:13:56:12 -0700] "POST /login HTTP/1.1" 200 2326
...

数据集说明

IP 地址：例如，127.0.0.1。
时间戳：例如，[10/Oct/2021:13:55:36 -0700]。
请求方法：例如，"GET" 或 "POST"。
请求 URL：例如，"/index.html"。
HTTP 响应码：例如，200、404 或 500。
响应大小：例如，1043。

任务要求

数据预处理：
- 解析每条日志记录，提取以下字段：IP 地址、请求时间、请求方法、请求 URL、HTTP 响应码、响应大小。
- 将解析后的数据格式化为结构化格式（例如，JSON）。
访问统计：
- 统计每个 IP 地址在一天中的访问次数。
- 统计每个请求 URL 在一天中的访问次数。
异常检测：
- 检测异常高的访问频率：对于每个 IP 地址，计算访问次数的平均值和标准差，标记访问次数超过均值加三倍标准差的 IP 地址。
- 检测潜在的恶意请求：检测 HTTP 响应码为 4xx 和 5xx 的请求，统计每个 IP 地址的异常请求次数，并标记异常请求次数占总请求次数比例超过 20% 的 IP 地址。
结果输出：
- 输出访问统计结果：每个 IP 地址的访问次数，每个请求 URL 的访问次数。
- 输出异常检测结果：异常高访问频率的 IP 地址及其访问次数，潜在的恶意请求 IP 地址及其异常请求次数和总请求次数的比例。

输入数据示例

127.0.0.1 - - [10/Oct/2021:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 1043
192.168.0.1 - - [10/Oct/2021:13:56:12 -0700] "POST /login HTTP/1.1" 200 2326
...

输出数据示例

访问统计结果：

IP访问次数:
127.0.0.1  150
192.168.0.1  200URL访问次数:
/index.html  300
/login  400

异常检测结果：

异常高访问频率 IP:
192.168.0.1  1200潜在恶意请求 IP:
127.0.0.1  50  25.0%

实现步骤

数据预处理 Mapper：
- 解析日志记录，提取必要字段并输出结构化数据。
访问统计 Mapper 和 Reducer：
- Mapper：统计每个 IP 地址和每个 URL 的访问次数。
- Reducer：汇总每个 IP 地址和每个 URL 的访问次数。
异常检测 Mapper 和 Reducer：
- Mapper：计算每个 IP 地址的访问次数，检测 HTTP 响应码为 4xx 和 5xx 的请求。
- Reducer：计算每个 IP 地址访问次数的均值和标准差，标记异常高访问频率的 IP 地址；统计每个 IP 地址的异常请求次数并计算异常请求比例，标记潜在的恶意请求 IP 地址。

解题思路

数据预处理 Mapper：解析日志记录，提取必要字段并输出结构化数据。
访问统计 Mapper 和 Reducer：统计每个 IP 地址和每个 URL 的访问次数。
异常检测 Mapper 和 Reducer：计算每个 IP 地址的访问次数，检测 HTTP 响应码为 4xx 和 5xx 的请求。
主方法：设置三个 MapReduce 作业：数据预处理、访问统计和异常检测。

1. 数据预处理

PreprocessMapper

package org.example.mapreduce.t1;import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;/*** @author 撕得失败的标签* @version 1.0* @description: 数据预处理* @date 2024/6/22 22:35*/public class PreprocessMapper extends Mapper<LongWritable, Text, Text, Text> {/*** @description:* 1. 解析每条日志记录，提取以下字段：IP 地址、请求时间、请求方法、请求 URL、HTTP 响应码、响应大小。* 2. 将解析后的数据格式化为结构化格式（例如，JSON）。* @author 撕得失败的标签* @date 2024/6/23 11:09*/@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] strings = line.split(" ");if (strings.length == 10) {// 提取匹配到的字段String ipAddress = strings[0];String timestamp = strings[3] + " " + strings[4];String requestMethod = strings[5];String requestUrl = strings[6];String httpStatusCode = strings[8];String responseSize = strings[9];context.write(new Text(ipAddress), new Text(timestamp + "," + requestMethod + "," + requestUrl + "," + httpStatusCode + "," + responseSize));}}
}

2. 访问统计

AccessStatistics

package org.example.mapreduce.t1;import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;import java.io.IOException;/*** @author 撕得失败的标签* @version 1.0* @description: 访问统计* @date 2024/6/22 22:55*/
public class AccessStatistics {/*** @description:* 1. 统计每个 IP 地址在一天中的访问次数。* 2. 统计每个请求 URL 在一天中的访问次数。* @author 撕得失败的标签* @date 2024/6/23 11:08*/public static class Map extends Mapper<LongWritable, Text, Text, LongWritable> {@Overrideprotected void map(LongWritable key, Text value, Context context) throws InterruptedException, IOException {String line = value.toString();String[] strings = line.split(" ");// 统计一天的，以 20/Jun/2024 为例if (strings[3].contains("20/Jun/2024")) {// IPcontext.write(new Text(strings[0]), new LongWritable(1));// URLcontext.write(new Text(strings[6]), new LongWritable(1));}}}public static class Reduce extends Reducer<Text, LongWritable, Text, LongWritable> {@Overrideprotected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {long sum = 0;for (LongWritable value : values) {sum += value.get();}context.write(key, new LongWritable(sum));}}
}

3. 异常检测

AnomalyDetection

package org.example.mapreduce.t1;import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;import java.io.IOException;
import java.util.HashMap;/*** @author 撕得失败的标签* @version 1.0* @description: 异常检测* @date 2024/6/23 11:08*/
public class AnomalyDetection {/*** @description:* 1. 检测异常高的访问频率：对于每个 IP 地址，计算访问次数的平均值和标准差，标记访问次数超过均值加三倍标准差的 IP 地址。* 2. 检测潜在的恶意请求：检测 HTTP 响应码为 4xx 和 5xx 的请求，*    统计每个 IP 地址的异常请求次数，*    并标记异常请求次数占总请求次数比例超过 20% 的 IP 地址。* @author 撕得失败的标签* @date 2024/6/23 11:08*/public static class Map extends Mapper<LongWritable, Text, Text, LongWritable> {@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String[] strings = value.toString().split(" ");String ip = strings[0];context.write(new Text(ip), new LongWritable(1));String httpStatusCode = strings[8];if (httpStatusCode.startsWith("4") || httpStatusCode.startsWith("5")) {String anomaly = "+" + ip;context.write(new Text(anomaly), new LongWritable(1));}}}public static class Reduce extends Reducer<Text, LongWritable, Text, LongWritable> {private final HashMap<String, Long> ipToCount = new HashMap<String, Long>();private final HashMap<String, Long> ipToAnomalyCount = new HashMap<String, Long>();@Overrideprotected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {long sum = 0;for (LongWritable value : values) {sum += value.get();}
//            context.write(key, new LongWritable(sum));String ip = key.toString();if (ip.startsWith("+")) {ip = ip.substring(1);ipToAnomalyCount.put(ip, sum);}ipToCount.put(ip, sum);}@Overrideprotected void cleanup(Context context) throws IOException, InterruptedException {// 实现异常检测的逻辑long sum = 0;for (String k : ipToCount.keySet()) {sum += ipToCount.get(k);}double avg = (double) (sum / ipToCount.size());double std = 0;for (String k : ipToCount.keySet()) {std += Math.pow(ipToCount.get(k) - avg, 2);}// 异常高访问频率 IPfor (String k : ipToCount.keySet()) {if (ipToCount.get(k) > avg + 3 * std) {context.write(new Text(k), new LongWritable(ipToCount.get(k)));}}// 潜在恶意请求 IPfor (String k : ipToAnomalyCount.keySet()) {double anomaly = (double) ipToAnomalyCount.get(k) / ipToCount.get(k);if (anomaly > 0.2) {context.write(new Text(k + "\t" + String.format("%.1f", anomaly * 100) + "%"), new LongWritable(ipToAnomalyCount.get(k)));}}}}
}

4. 主方法

Main

package org.example.mapreduce.t1;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.LinkedList;
import java.util.List;/*** @author 撕得失败的标签* @version 1.0* @description: 主方法* @date 2024/6/22 22:34*/
public class Main {public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {// 创建配置信息Configuration conf = new Configuration();conf.set("fs.default.name","hdfs://hadoop102:9000");// 1. 数据预处理 PreprocessMapperJob preprocessJob = Job.getInstance(conf, "preprocess job");preprocessJob.setJarByClass(Main.class);preprocessJob.setMapperClass(PreprocessMapper.class);preprocessJob.setOutputKeyClass(Text.class);preprocessJob.setOutputValueClass(Text.class);FileInputFormat.addInputPath(preprocessJob, new Path("/m1"));FileSystem fs = FileSystem.get(conf);Path outPath = new Path("/t1/preprocess");if(fs.exists(outPath)) {fs.delete(outPath, true);}FileOutputFormat.setOutputPath(preprocessJob, outPath);preprocessJob.waitForCompletion(true);// 2. 访问统计 AccessStatisticsJob accessStatisticsJob = Job.getInstance(conf, "access statistics job");accessStatisticsJob.setJarByClass(Main.class);accessStatisticsJob.setMapperClass(AccessStatistics.Map.class);accessStatisticsJob.setReducerClass(AccessStatistics.Reduce.class);accessStatisticsJob.setOutputKeyClass(Text.class);accessStatisticsJob.setOutputValueClass(LongWritable.class);FileInputFormat.addInputPath(accessStatisticsJob, new Path("/m1"));FileSystem fs1 = FileSystem.get(conf);Path outPath1 = new Path("/t1/statistics");if(fs1.exists(outPath1)) {fs1.delete(outPath1, true);}FileOutputFormat.setOutputPath(accessStatisticsJob, outPath1);accessStatisticsJob.waitForCompletion(true);// 3. 异常检测 AnomalyDetectionJob anomalyDetectionJob = Job.getInstance(conf, "anomaly detection job");anomalyDetectionJob.setJarByClass(Main.class);anomalyDetectionJob.setMapperClass(AnomalyDetection.Map.class);anomalyDetectionJob.setReducerClass(AnomalyDetection.Reduce.class);anomalyDetectionJob.setOutputKeyClass(Text.class);anomalyDetectionJob.setOutputValueClass(LongWritable.class);FileInputFormat.addInputPath(anomalyDetectionJob, new Path("/m1"));FileSystem fs2 = FileSystem.get(conf);Path outPath2 = new Path("/t1/anomaly");if(fs2.exists(outPath2)) {fs2.delete(outPath2, true);}FileOutputFormat.setOutputPath(anomalyDetectionJob, outPath2);anomalyDetectionJob.waitForCompletion(true);// 4. 输出结果 Output// 访问统计结果：FileSystem fs3 = FileSystem.get(conf);Path outPath3 = new Path("/t1/statistics/part-r-00000");BufferedReader br = new BufferedReader(new InputStreamReader(fs3.open(outPath3)));List<String> ip = new LinkedList<String>();List<String> url = new LinkedList<String>();String line;while ((line = br.readLine()) != null) {if (line.startsWith("/")) {url.add(line);} else {ip.add(line);}}// IP访问次数:System.out.println("\nIP访问次数:");for (String s : ip) {System.out.println(s);}// URL访问次数:System.out.println("\nURL访问次数:");for (String s : url) {System.out.println(s);}// 异常检测结果：FileSystem fs4 = FileSystem.get(conf);Path outPath4 = new Path("/t1/anomaly/part-r-00000");BufferedReader br1 = new BufferedReader(new InputStreamReader(fs4.open(outPath4)));List<String> potential = new LinkedList<String>();List<String> anomaly = new LinkedList<String>();String line1;while ((line1 = br1.readLine()) != null) {String[] strings = line1.split("\t");if (strings.length == 2) {anomaly.add(line1);} else {potential.add(line1);}}// 异常高访问频率 IP:System.out.println("\n异常高访问频率 IP:");if (anomaly.size() == 0) {System.out.println("无");} else {for (String s : anomaly) {System.out.println(s);}}// 潜在恶意请求 IP:System.out.println("\n潜在异常高访问频率 IP:");if (potential.size() == 0) {System.out.println("无");} else {for (String s : potential) {String[] strings = s.split("\t");System.out.println(strings[0] + "\t" + strings[2] + "\t" + strings[1]);}}}
}

5. 结果输出

IP访问次数:
10.0.0.1	334003
10.0.0.2	334350
10.0.0.3	333056
10.0.0.4	333947
10.0.0.5	333263
127.0.0.1	332347
127.0.0.2	333025
127.0.0.3	332450
127.0.0.4	333005
127.0.0.5	333428
192.168.0.1	334054
192.168.0.2	332883
192.168.0.3	333681
192.168.0.4	333133
192.168.0.5	333375URL访问次数:
/cart	713975
/checkout	713453
/contact	715382
/home.html	712570
/index.html	715544
/login	714255
/products	714821异常高访问频率 IP:
无潜在异常高访问频率 IP:
192.168.0.2	222498	66.8%
192.168.0.1	222165	66.5%
127.0.0.5	221778	66.5%
192.168.0.4	222096	66.7%
127.0.0.4	222156	66.7%
192.168.0.3	222227	66.6%
192.168.0.5	222070	66.6%
10.0.0.4	222243	66.6%
10.0.0.3	221966	66.6%
10.0.0.5	222347	66.7%
10.0.0.2	222664	66.6%
10.0.0.1	222493	66.6%
127.0.0.3	221464	66.6%
127.0.0.2	222197	66.7%
127.0.0.1	221702	66.7%Process finished with exit code 0

这篇关于MapReduce 实践题：Web 访问日志分析与异常检测的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

MapReduce 实践题：Web 访问日志分析与异常检测

文章目录

作业描述

MapReduce 实践题：Web 访问日志分析与异常检测

题目背景

数据集说明

任务要求

输入数据示例

输出数据示例

实现步骤

解题思路

1. 数据预处理

2. 访问统计

3. 异常检测

4. 主方法

5. 结果输出

相关文章

Spring Security简介、使用与最佳实践

java.sql.SQLTransientConnectionException连接超时异常原因及解决方案

防止Linux rm命令误操作的多场景防护方案与实践

Python中 try / except / else / finally 异常处理方法详解

C++统计函数执行时间的最佳实践

PHP应用中处理限流和API节流的最佳实践

SpringBoot日志级别与日志分组详解

ShardingProxy读写分离之原理、配置与实践过程

深入浅出Spring中的@Autowired自动注入的工作原理及实践应用

Debian 13升级后网络转发等功能异常怎么办? 并非错误而是管理机制变更