hadoop入门--使用MapReduce统计每个航班班次

2024-08-24 02:58

本文主要是介绍hadoop入门--使用MapReduce统计每个航班班次,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

案例基于hadoop 2.73,伪分布式集群

一,创建一个MapReduce应用

MapReduce应用结构如图:
这里写图片描述

1、引入maven依赖

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.hadoop</groupId><artifactId>beginner</artifactId><version>1.0-SNAPSHOT</version><packaging>jar</packaging><name>beginner</name><url>http://maven.apache.org</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding></properties><dependencies><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-core</artifactId><version>1.2.1</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.7.3</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>2.7.3</version></dependency><dependency><groupId>au.com.bytecode</groupId><artifactId>opencsv</artifactId><version>2.4</version></dependency></dependencies><build><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-shade-plugin</artifactId><version>1.2.1</version><executions><execution><phase>package</phase><goals><goal>shade</goal></goals><configuration><transformers><transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"><mainClass>com.hadoop.FlightsByCarrier</mainClass></transformer></transformers></configuration></execution></executions></plugin></plugins></build></project>

2、MapReduce Driver代码

是用户与hadoop集群交互的客户端,在此配置MapReduce Job。

package com.hadoop;import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class FlightsByCarrier {public static void main(String[] args)  throws Exception {Job job = new Job();job.setJarByClass(FlightsByCarrier.class);job.setJobName("FlightsByCarrier");TextInputFormat.addInputPath(job, new Path(args[0]));job.setInputFormatClass(TextInputFormat.class);job.setMapperClass(FlightsByCarrierMapper.class);job.setReducerClass(FlightsByCarrierReducer.class);TextOutputFormat.setOutputPath(job, new Path(args[1]));job.setOutputFormatClass(TextOutputFormat.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.waitForCompletion(true);}
}

3、MapReduce Mapper代码

package com.hadoop;import au.com.bytecode.opencsv.CSVParser;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;public class FlightsByCarrierMapper extends Mapper<LongWritable, Text, Text, IntWritable>{@Overrideprotected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {if (key.get() > 0) {String[] lines = new CSVParser().parseLine(value.toString());context.write(new Text(lines[8]), new IntWritable(1));}}
}

4、MapReduce Reducer代码

package com.hadoop;import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;import java.io.IOException;public class FlightsByCarrierReducer extends Reducer<Text, IntWritable, Text, IntWritable>{@Overrideprotected void reduce(Text token, Iterable<IntWritable> counts,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable count : counts) {sum+= count.get();}context.write(token, new IntWritable(sum));}
}

5、利用idea maven打jar包

jar包名称为:beginner-1.0-SNAPSHOT.jar

6、上传到linux虚拟机

代码是在window系统中的idea编写完成,需要上传到Linux虚拟机。

7、运行MapReduce Driver,处理航班数据

hadoop jar beginner-1.0-SNAPSHOT.jar  /user/root/2008.csv /user/root/output/flightsCount

运行情况如下:

18/01/09 02:29:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/09 02:29:52 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/01/09 02:29:53 INFO input.FileInputFormat: Total input paths to process : 1
18/01/09 02:29:54 INFO mapreduce.JobSubmitter: number of splits:6
18/01/09 02:29:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515491426576_0002
18/01/09 02:29:54 INFO impl.YarnClientImpl: Submitted application application_1515491426576_0002
18/01/09 02:29:55 INFO mapreduce.Job: The url to track the job: http://slave1:8088/proxy/application_1515491426576_0002/
18/01/09 02:29:55 INFO mapreduce.Job: Running job: job_1515491426576_0002
18/01/09 02:30:01 INFO mapreduce.Job: Job job_1515491426576_0002 running in uber mode : false
18/01/09 02:30:01 INFO mapreduce.Job:  map 0% reduce 0%
18/01/09 02:30:17 INFO mapreduce.Job:  map 39% reduce 0%
18/01/09 02:30:19 INFO mapreduce.Job:  map 52% reduce 0%
18/01/09 02:30:21 INFO mapreduce.Job:  map 86% reduce 0%
18/01/09 02:30:22 INFO mapreduce.Job:  map 100% reduce 0%
18/01/09 02:30:31 INFO mapreduce.Job:  map 100% reduce 100%
18/01/09 02:30:32 INFO mapreduce.Job: Job job_1515491426576_0002 completed successfully
18/01/09 02:30:32 INFO mapreduce.Job: Counters: 49File System CountersFILE: Number of bytes read=63087558FILE: Number of bytes written=127016400FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=689434454HDFS: Number of bytes written=197HDFS: Number of read operations=21HDFS: Number of large read operations=0HDFS: Number of write operations=2Job Counters Launched map tasks=6Launched reduce tasks=1Data-local map tasks=6Total time spent by all maps in occupied slots (ms)=110470Total time spent by all reduces in occupied slots (ms)=7315Total time spent by all map tasks (ms)=110470Total time spent by all reduce tasks (ms)=7315Total vcore-milliseconds taken by all map tasks=110470Total vcore-milliseconds taken by all reduce tasks=7315Total megabyte-milliseconds taken by all map tasks=113121280Total megabyte-milliseconds taken by all reduce tasks=7490560Map-Reduce FrameworkMap input records=7009729Map output records=7009728Map output bytes=49068096Map output materialized bytes=63087588Input split bytes=630Combine input records=0Combine output records=0Reduce input groups=20Reduce shuffle bytes=63087588Reduce input records=7009728Reduce output records=20Spilled Records=14019456Shuffled Maps =6Failed Shuffles=0Merged Map outputs=6GC time elapsed (ms)=6818CPU time spent (ms)=38010Physical memory (bytes) snapshot=1807056896Virtual memory (bytes) snapshot=13627478016Total committed heap usage (bytes)=1370488832Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=689433824File Output Format Counters Bytes Written=197

8、查看航班数据

hadoop fs -cat /user/root/output/flightsCount/part-r-00000

结果如下:

9E  262208
AA  604885
AQ  7800
AS  151102
B6  196091
CO  298455
DL  451931
EV  280575
F9  95762
FL  261684
HA  61826
MQ  490693
NW  347652
OH  197607
OO  567159
UA  449515
US  453589
WN  1201754
XE  374510
YV  254930

参考资料:
1、《Hadoop For Dummies》

这篇关于hadoop入门--使用MapReduce统计每个航班班次的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1101251

相关文章

Spring Security 从入门到进阶系列教程

Spring Security 入门系列 《保护 Web 应用的安全》 《Spring-Security-入门(一):登录与退出》 《Spring-Security-入门(二):基于数据库验证》 《Spring-Security-入门(三):密码加密》 《Spring-Security-入门(四):自定义-Filter》 《Spring-Security-入门(五):在 Sprin

中文分词jieba库的使用与实景应用(一)

知识星球:https://articles.zsxq.com/id_fxvgc803qmr2.html 目录 一.定义: 精确模式(默认模式): 全模式: 搜索引擎模式: paddle 模式(基于深度学习的分词模式): 二 自定义词典 三.文本解析   调整词出现的频率 四. 关键词提取 A. 基于TF-IDF算法的关键词提取 B. 基于TextRank算法的关键词提取

Hadoop企业开发案例调优场景

需求 (1)需求:从1G数据中,统计每个单词出现次数。服务器3台,每台配置4G内存,4核CPU,4线程。 (2)需求分析: 1G / 128m = 8个MapTask;1个ReduceTask;1个mrAppMaster 平均每个节点运行10个 / 3台 ≈ 3个任务(4    3    3) HDFS参数调优 (1)修改:hadoop-env.sh export HDFS_NAMENOD

使用SecondaryNameNode恢复NameNode的数据

1)需求: NameNode进程挂了并且存储的数据也丢失了,如何恢复NameNode 此种方式恢复的数据可能存在小部分数据的丢失。 2)故障模拟 (1)kill -9 NameNode进程 [lytfly@hadoop102 current]$ kill -9 19886 (2)删除NameNode存储的数据(/opt/module/hadoop-3.1.4/data/tmp/dfs/na

Hadoop集群数据均衡之磁盘间数据均衡

生产环境,由于硬盘空间不足,往往需要增加一块硬盘。刚加载的硬盘没有数据时,可以执行磁盘数据均衡命令。(Hadoop3.x新特性) plan后面带的节点的名字必须是已经存在的,并且是需要均衡的节点。 如果节点不存在,会报如下错误: 如果节点只有一个硬盘的话,不会创建均衡计划: (1)生成均衡计划 hdfs diskbalancer -plan hadoop102 (2)执行均衡计划 hd

hadoop开启回收站配置

开启回收站功能,可以将删除的文件在不超时的情况下,恢复原数据,起到防止误删除、备份等作用。 开启回收站功能参数说明 (1)默认值fs.trash.interval = 0,0表示禁用回收站;其他值表示设置文件的存活时间。 (2)默认值fs.trash.checkpoint.interval = 0,检查回收站的间隔时间。如果该值为0,则该值设置和fs.trash.interval的参数值相等。

Hadoop数据压缩使用介绍

一、压缩原则 (1)运算密集型的Job,少用压缩 (2)IO密集型的Job,多用压缩 二、压缩算法比较 三、压缩位置选择 四、压缩参数配置 1)为了支持多种压缩/解压缩算法,Hadoop引入了编码/解码器 2)要在Hadoop中启用压缩,可以配置如下参数

Makefile简明使用教程

文章目录 规则makefile文件的基本语法:加在命令前的特殊符号:.PHONY伪目标: Makefilev1 直观写法v2 加上中间过程v3 伪目标v4 变量 make 选项-f-n-C Make 是一种流行的构建工具,常用于将源代码转换成可执行文件或者其他形式的输出文件(如库文件、文档等)。Make 可以自动化地执行编译、链接等一系列操作。 规则 makefile文件

hdu1496(用hash思想统计数目)

作为一个刚学hash的孩子,感觉这道题目很不错,灵活的运用的数组的下标。 解题步骤:如果用常规方法解,那么时间复杂度为O(n^4),肯定会超时,然后参考了网上的解题方法,将等式分成两个部分,a*x1^2+b*x2^2和c*x3^2+d*x4^2, 各自作为数组的下标,如果两部分相加为0,则满足等式; 代码如下: #include<iostream>#include<algorithm

使用opencv优化图片(画面变清晰)

文章目录 需求影响照片清晰度的因素 实现降噪测试代码 锐化空间锐化Unsharp Masking频率域锐化对比测试 对比度增强常用算法对比测试 需求 对图像进行优化,使其看起来更清晰,同时保持尺寸不变,通常涉及到图像处理技术如锐化、降噪、对比度增强等 影响照片清晰度的因素 影响照片清晰度的因素有很多,主要可以从以下几个方面来分析 1. 拍摄设备 相机传感器:相机传