Beyond Hadoop: Next-Generation Big Data Architectures(zz)

2024-01-23 05:58

本文主要是介绍Beyond Hadoop: Next-Generation Big Data Architectures(zz),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!


http://duanple.blog.163.com/blog/static/7097176720101118102636423/

After 25 years of dominance, relational databases and SQL have in recent years come under fire from the growing “NoSQL movement.” A key element of this movement is Hadoop, the open-source clone of Google’s internal MapReduce system. Whether it’s interpreted as “No SQL” or “Not Only SQL,” the message has been clear: If you have big data challenges, then your programming tool of choice should be Hadoop.

The only problem with this story is that the people who really do have cutting edge performance and scalability requirements today have already moved on from the Hadoop model. A few have moved back to SQL, but the much more significant trend is that, having come to realize the capabilities and limitations of MapReduce and Hadoop, a whole raft of radically new post-Hadoop architectures are now being developed that are, in most cases, orders of magnitude faster at scale than Hadoop. We are now at the start of a new “NoHadoop” era, as companies increasingly realize that big data requires “Not Only Hadoop.”

Simple batch processing tools like MapReduce and Hadoop are just not powerful enough in any one of the dimensions of the big data space that really matters. Sure, Hadoop is great for simple batch processing tasks that are “embarrassingly parallel”, but most of the difficult big data tasks confronting companies today are much more complex than that. They can involve complex joins, ACID requirements, real-time requirements, supercomputing algorithms, graph computing, interactive analysis, or the need for continuous incremental updates. In each case, Hadoop is unable to provide anything close to the levels of performance required. Fortunately, however, in each case there now exist next-generation big data architectures that can provide that required scale and performance. Over the next couple of years, these architectures will break out into the mainstream.

Here is a brief overview of the current NoHadoop or post-Hadoop space. In each case, the next-gen architecture beats MapReduce/Hadoop by anything from 10x to 10,000x in terms of performance at scale.

SQL. Having been around for 25 years, it’s a bit weird to call SQL next-gen, but it is! There’s currently a tremendous amount of innovation going on around SQL from companies like VoltDB, Clustrix and others. If you need to handle complex joins, or need ACID requirements, SQL is still the way to go. Applications: Complex business queries, online transaction processing.

Cloudscale. [McColl is the CEO of Cloudscale. See his bio below.] For realtime analytics on big data, it’s essential to break free from the constraints of batch processing. For example, if you’re looking to continuously analyze a stream of events at a rate of one million events per second per server, and deliver results with a maximum latency of five seconds between data in and analytics out, then you need a real-time data flow architecture. The Cloudscale architecture provides this kind of realtime big data analytics, with latency that is up to 10,000X faster than batch processing systems such as Hadoop. Applications: Algorithmic trading, fraud detection, mobile advertising, location services, marketing intelligence.

MPI and BSP. Many supercomputing applications require complex algorithms on big data, in which processors communicate directly at very high speed in order to deliver performance at scale. Parallel programming tools such as MPI and BSP are necessary for this kind of high performance supercomputing. Applications: Modelling and simulation, fluid dynamics.

Pregel. Need to analyse a complex social graph? Need to analyse the web? It’s not just big data, it’s big graphs! We’re rapidly moving to a world where the ability to analyse very-large-scale dynamic graphs (billions of nodes, trillions of edges) is becoming critical for some important applications. Google’s Pregel architecture uses a BSP model to enable highly efficient graph computing at enormous scale. Applications: Web algorithms, social graph algorithms, location graphs, learning and discovery, network optimisation, internet of things.

Dremel. Need to interact with web-scale data sets? Google’s Dremel architecture is designed to support interactive, ad hoc queries over trillion-row tables in seconds! It executes queries natively without translating them into MapReduce jobs. Dremel has been in production since 2006 and has thousands of users within Google. Applications: Data exploration, customer support, data center monitoring.

Percolator (Caffeine). If you need to incrementally update the analytics on a massive data set continuously, as Google now has to do on its index of the web, then an architecture like Percolator (Caffeine) beats Hadoop easily; Google Instant just wouldn’t be possible without it. “Because the index can be updated incrementally, the median document moves through Caffeine over 100 times faster than it moved through the company’s old MapReduce setup.” Applications: Real time search.

The fact that Hadoop is freely available to everyone means it will remain an important entry point to the world of big data for many people. However, as the performance demands for big data apps continue to increase, we will find these new, more powerful forms of big data architecture will be required in many cases.

Bill McColl is the founder and CEO of Cloudscale Inc. and a former professor of Computer Science, Head of the Parallel Computing Research Center, and Chairman of the Computer Science Faculty at Oxford University.


这篇关于Beyond Hadoop: Next-Generation Big Data Architectures(zz)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/635423

相关文章

Hadoop企业开发案例调优场景

需求 (1)需求:从1G数据中,统计每个单词出现次数。服务器3台,每台配置4G内存,4核CPU,4线程。 (2)需求分析: 1G / 128m = 8个MapTask;1个ReduceTask;1个mrAppMaster 平均每个节点运行10个 / 3台 ≈ 3个任务(4    3    3) HDFS参数调优 (1)修改:hadoop-env.sh export HDFS_NAMENOD

Hadoop集群数据均衡之磁盘间数据均衡

生产环境,由于硬盘空间不足,往往需要增加一块硬盘。刚加载的硬盘没有数据时,可以执行磁盘数据均衡命令。(Hadoop3.x新特性) plan后面带的节点的名字必须是已经存在的,并且是需要均衡的节点。 如果节点不存在,会报如下错误: 如果节点只有一个硬盘的话,不会创建均衡计划: (1)生成均衡计划 hdfs diskbalancer -plan hadoop102 (2)执行均衡计划 hd

hadoop开启回收站配置

开启回收站功能,可以将删除的文件在不超时的情况下,恢复原数据,起到防止误删除、备份等作用。 开启回收站功能参数说明 (1)默认值fs.trash.interval = 0,0表示禁用回收站;其他值表示设置文件的存活时间。 (2)默认值fs.trash.checkpoint.interval = 0,检查回收站的间隔时间。如果该值为0,则该值设置和fs.trash.interval的参数值相等。

Hadoop数据压缩使用介绍

一、压缩原则 (1)运算密集型的Job,少用压缩 (2)IO密集型的Job,多用压缩 二、压缩算法比较 三、压缩位置选择 四、压缩参数配置 1)为了支持多种压缩/解压缩算法,Hadoop引入了编码/解码器 2)要在Hadoop中启用压缩,可以配置如下参数

论文翻译:arxiv-2024 Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey https://arxiv.org/abs/2406.04244 大规模语言模型的基准数据污染:一项综述 文章目录 大规模语言模型的基准数据污染:一项综述摘要1 引言 摘要 大规模语言模型(LLMs),如GPT-4、Claude-3和Gemini的快

CentOS下mysql数据库data目录迁移

https://my.oschina.net/u/873762/blog/180388        公司新上线一个资讯网站,独立主机,raid5,lamp架构。由于资讯网是面向小行业,初步估计一两年内访问量压力不大,故,在做服务器系统搭建的时候,只是简单分出一个独立的data区作为数据库和网站程序的专区,其他按照linux的默认分区。apache,mysql,php均使用yum安装(也尝试

使用Spring Boot集成Spring Data JPA和单例模式构建库存管理系统

引言 在企业级应用开发中,数据库操作是非常重要的一环。Spring Data JPA提供了一种简化的方式来进行数据库交互,它使得开发者无需编写复杂的JPA代码就可以完成常见的CRUD操作。此外,设计模式如单例模式可以帮助我们更好地管理和控制对象的创建过程,从而提高系统的性能和可维护性。本文将展示如何结合Spring Boot、Spring Data JPA以及单例模式来构建一个基本的库存管理系统

15 组件的切换和对组件的data的使用

划重点 a 标签的使用事件修饰符组件的定义组件的切换:登录 / 注册 泡椒鱼头 :微辣 <!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta http-equiv="X-UA-

12C 新特性,MOVE DATAFILE 在线移动 包括system, 附带改名 NID ,cdb_data_files视图坏了

ALTER DATABASE MOVE DATAFILE  可以改名 可以move file,全部一个命令。 resue 可以重用,keep好像不生效!!! system照移动不误-------- SQL> select file_name, status, online_status from dba_data_files where tablespace_name='SYSTEM'

LLVM入门2:如何基于自己的代码生成IR-LLVM IR code generation实例介绍

概述 本节将通过一个简单的例子来介绍如何生成llvm IR,以Kaleidoscope IR中的例子为例,我们基于LLVM接口构建一个简单的编译器,实现简单的语句解析并转化为LLVM IR,生成对应的LLVM IR部分,代码如下,文件名为toy.cpp,先给出代码,后面会详细介绍每一步分代码: #include "llvm/ADT/APFloat.h"#include "llvm/ADT/S