Hadoop入门实践之从WordCount程序说起

本文主要是介绍Hadoop入门实践之从WordCount程序说起，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

这段时间需要学习Hadoop了，以前一直听说Hadoop，但是从来没有研究过，这几天粗略看完了《Hadoop实战》这本书，对Hadoop编程有了大致的了解。接下来就是多看多写了。以Hadoop自带的例子WordCount程序开始，来记录我的Hadoop学习过程。

Hadoop自带例子WordCount.java

[java]  view plain copy  
 /** 
  *  Licensed under the Apache License, Version 2.0 (the "License"); 
  *  you may not use this file except in compliance with the License. 
  *  You may obtain a copy of the License at 
  * 
  *      http://www.apache.org/licenses/LICENSE-2.0 
  * 
  *  Unless required by applicable law or agreed to in writing, software 
  *  distributed under the License is distributed on an "AS IS" BASIS, 
  *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
  *  See the License for the specific language governing permissions and 
  *  limitations under the License. 
  */  
   
   
 package org.apache.hadoop.examples;  
   
 import java.io.IOException;  
 import java.util.StringTokenizer;  
   
 import org.apache.hadoop.conf.Configuration;  
 import org.apache.hadoop.fs.Path;  
 import org.apache.hadoop.io.IntWritable;  
 import org.apache.hadoop.io.Text;  
 import org.apache.hadoop.mapreduce.Job;  
 import org.apache.hadoop.mapreduce.Mapper;  
 import org.apache.hadoop.mapreduce.Reducer;  
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
 import org.apache.hadoop.util.GenericOptionsParser;  
   
 public class WordCount {  
   
   public static class TokenizerMapper   
        extends Mapper<Object, Text, Text, IntWritable>{  
       
     private final static IntWritable one = new IntWritable(1);  
     private Text word = new Text();  
         
     public void map(Object key, Text value, Context context  
                     ) throws IOException, InterruptedException {  
       StringTokenizer itr = new StringTokenizer(value.toString());  
       while (itr.hasMoreTokens()) {  
         word.set(itr.nextToken());  
         context.write(word, one);  
       }  
     }  
   }  
     
   public static class IntSumReducer   
        extends Reducer<Text,IntWritable,Text,IntWritable> {  
     private IntWritable result = new IntWritable();  
   
     public void reduce(Text key, Iterable<IntWritable> values,   
                        Context context  
                        ) throws IOException, InterruptedException {  
       int sum = 0;  
       for (IntWritable val : values) {  
         sum += val.get();  
       }  
       result.set(sum);  
       context.write(key, result);  
     }  
   }  
   
   public static void main(String[] args) throws Exception {  
     Configuration conf = new Configuration();  
     String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();  
     if (otherArgs.length != 2) {  
       System.err.println("Usage: wordcount <in> <out>");  
       System.exit(2);  
     }  
     Job job = new Job(conf, "word count");  
     job.setJarByClass(WordCount.class);  
     job.setMapperClass(TokenizerMapper.class);  
     job.setReducerClass(IntSumReducer.class);  
     job.setOutputKeyClass(Text.class);  
     job.setOutputValueClass(IntWritable.class);  
     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));  
     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));  
     System.exit(job.waitForCompletion(true) ? 0 : 1);  
   }  
 }  

这个程序的功能是对文件中各个单词的数目进行统计。

在Wordount.java中有两个静态内部类TokenizerMapper，IntSumReducer，关于静态内部类，可以参考另一篇文章 Java中的静态内部类。这两个类分别对应与MapReduce中的map和reduce。至于为什么要用静态的内部类，个人理解是这样的：一般一个简单作业（Job）包含了一个map过程和一个reduce过程，Job，Map，Reduce写在一个文件中便于文件的组织。但是，Hadoop内部需要使用反射的方式来实例化客户端的Map和Reduce，所以使用了静态内部类的方式，参考了StackOverflow上的一个帖子： Do Mappers and Reducers in Hadoop have to be static classes?，如果不许要将Job，Map和Reduce组织在一起，完全可以将这三个类写在三个类文件中。

在程序的main函数中首先实例化一个Configuration，用于加载Hadoop的配置信息，然后就解析给程序传递的参数，这里我们传递了两个字符串参数，经过解析之后保存在有两个元素的数组otherArgs中，其中otherArgs[0]为要进行统计的文件的路径，otherArgs[1]为经过MapReduce计算之后的结果所保存的位置。

[java]  view plain copy  
 Job job = new Job(conf, "word count");  

语句实例化一个Job对象，然后就为Job对像指定运行时所需的类

[java]  view plain copy  
 job.setJarByClass(WordCount.class);  

表示告诉Hadoop集群，作业从哪个类开始运行，

[java]  view plain copy  
 job.setMapperClass(TokenizerMapper.class);  

表示执行哪个类的map方法，我们这里指定的是方法

[java]  view plain copy  
 public void map(Object key, Text value, Context context  
                    ) throws IOException, InterruptedException {  
      StringTokenizer itr = new StringTokenizer(value.toString());  
      while (itr.hasMoreTokens()) {  
        word.set(itr.nextToken());  
        context.write(word, one);  
      }  
    }  

这个方法对要进行map的每行数据，使用StringTokenizer类进行分割，分割出来的值在保存到context中进行，从而在reduce中进行单词数量统计。

[java]  view plain copy  
 job.setReducerClass(IntSumReducer.class);  

这行语句设置用于进行Reduce的类，告诉Hadoop集群执行哪个reduce函数：

[java]  view plain copy  
 public void reduce(Text key, Iterable<IntWritable> values,   
                       Context context  
                       ) throws IOException, InterruptedException {  
      int sum = 0;  
      for (IntWritable val : values) {  
        sum += val.get();  
      }  
      result.set(sum);  
      context.write(key, result);  
    }  

在这个函数执行之前，Hadoop已经为我们将各个单词的个数大概的归并在一起了，函数的前两个参数是Text 类型和Iterable类型，参数名分别为key和alues，其中在这里key表示在map方法中分割得到的单词，values表示在map阶段统计的单词的数量（由于reduce阶段接收到多个数据结点发送过来的统计结果，所以对应于一个key，可能有多个value，所以将这些value都保存在一迭代器中，然后对迭代器进行遍历，这个过程以后再讨论。），遍历values迭代器，对每个key的数量进行汇总，然后再记录在context中。

[java]  view plain copy  
 job.setOutputKeyClass(Text.class);  
 job.setOutputValueClass(IntWritable.class);  

表示MapReduce执行结束之后，将结果保存在HDFS中时，保存的数据类型。这里将结果的key以Text类型保存，value以IntWritable类型保存。

[java]  view plain copy  
 FileInputFormat.addInputPath(job, new Path(otherArgs[0]));  
 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));  

分别表示输入和输出的路径。

这个程序相对于Hadoop的例子，我去掉了

[java]  view plain copy  
 job.setCombinerClass(IntSumReducer.class);  

这行语句，在Hadoop中，Combiner主要用于提升Hadoop的处理效率，为了集中于理解MapReduce，我去掉了这行代码，待以后讨论提升Hadoop性能时，再学习Combiner。

原文地址：点击打开链接

这篇关于Hadoop入门实践之从WordCount程序说起的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

Hadoop入门实践之从WordCount程序说起

相关文章

Spring WebFlux 与 WebClient 使用指南及最佳实践

MyBatis-Plus 中 nested() 与 and() 方法详解(最佳实践场景)

Spring Boot @RestControllerAdvice全局异常处理最佳实践

从入门到精通MySQL联合查询

Spring事务传播机制最佳实践

Java中的雪花算法Snowflake解析与实践技巧

从入门到精通C++11 ＜chrono＞库特性

MySQL 中 ROW_NUMBER() 函数最佳实践

解析C++11 static_assert及与Boost库的关联从入门到精通

深度解析Spring AOP @Aspect 原理、实战与最佳实践教程