本文主要是介绍Hadoop 1.x的Shuffle源码分析之3,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
shuffle有两种,一种是在内存存储数据,另一种是在本地文件存储数据,两者几乎一致。
以本地文件进行shuffle的过程为例:
mapOutput = shuffleToDisk(mapOutputLoc, input, filename, compressedLength)
shuffleToDisk函数如下:
private MapOutput shuffleToDisk(MapOutputLocation mapOutputLoc,InputStream input,Path filename,long mapOutputLength) throws IOException {
// Find out a suitable location for the output on local-filesystem
//在本地文件系统做输出,输出文件的pathPath localFilename = lDirAlloc.getLocalPathForWrite(filename.toUri().getPath(), mapOutputLength, conf);
//创建Map输出MapOutput mapOutput = new MapOutput(mapOutputLoc.getTaskId(), mapOutputLoc.getTaskAttemptId(), conf, localFileSys.makeQualified(localFilename), mapOutputLength);// Copy data to local-disk
//从input读取数据,写入到本地文件,这个input是http连接创建的流式输入OutputStream output = null;long bytesRead = 0;try {output = rfs.create(localFilename);byte[] buf = new byte[64 * 1024];int n = -1;try {n = input.read(buf, 0, buf.length);} catch (IOException ioe) {readError = true;throw ioe;}while (n > 0) {bytesRead += n;shuffleClientMetrics.inputBytes(n);output.write(buf, 0, n);// indicate we're making progressreporter.progress();try {n = input.read(buf, 0, buf.length);} catch (IOException ioe) {readError = true;throw ioe;}}LOG.info("Read " + bytesRead + " bytes from map-output for " +mapOutputLoc.getTaskAttemptId());
//正常取完数据,关闭。output.close();input.close();} catch (IOException ioe) {LOG.info("Failed to shuffle from " + mapOutputLoc.getTaskAttemptId(), ioe);// Discard the map-output
try {mapOutput.discard();} catch (IOException ignored) {LOG.info("Failed to discard map-output from " + mapOutputLoc.getTaskAttemptId(), ignored);}mapOutput = null;// Close the streamsIOUtils.cleanup(LOG, input, output);// Re-throwthrow ioe;}// Sanity check
//检查读取是否正常if (bytesRead != mapOutputLength) {try {mapOutput.discard();} catch (Exception ioe) {// IGNORED because we are cleaning upLOG.info("Failed to discard map-output from " + mapOutputLoc.getTaskAttemptId(), ioe);} catch (Throwable t) {String msg = getTaskID() + " : Failed in shuffle to disk :" + StringUtils.stringifyException(t);reportFatalError(getTaskID(), t, msg);}mapOutput = null;throw new IOException("Incomplete map output received for " +mapOutputLoc.getTaskAttemptId() + " from " +mapOutputLoc.getOutputLocation() + " (" + bytesRead + " instead of " + mapOutputLength + ")");}return mapOutput;}
所以说,这一段shuffle的本质就是,从http的输入流读取数据,然后存放在本地文件系统的磁盘文件,写完之后,把taskId, jobid,本地文件名等等诸多参数放在MapOutput对象记录下来,然后返回一个MapOutput对象。
java的代码很直接,没有花花绕的东东,除了略有一点冗长,实在没什么缺点 :)
这篇关于Hadoop 1.x的Shuffle源码分析之3的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!