DOT--A Matrix Model for Analyzing,Optimizing and Deploying Software for Big Data Analytics in Distri

本文主要是介绍DOT--A Matrix Model for Analyzing,Optimizing and Deploying Software for Big Data Analytics in Distri,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

1. Abstract
      Traditional parallel processing models, such as BSP, are “scale up” based, aiming to achieve high performance by increasing computing power, interconnection network bandwidth, and memory/storage capacity within dedicated systems, while big data analytics tasks aiming for high throughput demand that large distributed systems “scale out” by continuously adding computing and storage resources through networks. Each one of the “scale up” model and “scale out” model has a different set of performance requirements and system bottlenecks. 
      In this paper, we develop a general model that abstracts critical computation and communication behavior and computation-communication interactions for big data analytics in a scalable and fault-tolerant manner. Our model is called DOT, represented by three matrices for data sets (D), concurrent data processing operations (O), and data transformations (T), respectively. 
      With the DOT model, any big data analytics job execution in various software frameworks can be represented by a specific or non-specific number of elementary/composite DOT blocks, each of which performs operations on the data sets, stores intermediate results, makes necessary data transfers, and performs data transformations in the end. The DOT model achieves the goals of scalability and fault-tolerance by enforcing a data-dependency-free relationship among concurrent tasks. Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation independent, and applicable to a wide variety of big data analytics jobs. Finally, we demonstrate the effectiveness of the DOT model through several case studies.
2. two common traditional goals(include Google MapReduce, Hadoop, Dryad and Pregel
      (1) for distributed applications, to provide a scalable and fault-tolerant system infrastructure and supporting environment; and
      (2) for software developers and application practitioners, to provide an easy-to-use programming model that hides the technical details of parallelization and fault-tolerance.
3. the following three issues to be addressed demand more basic and fundamental research efforts
      -Behavior Abstraction: The “scale out” model of big data analytics mainly concerns two issues:
            (1) how to maintain the scalability, namely to ensure a proportional increase of data processing throughput as the size of the data and the number of computing nodes increase; and
            (2) how to provide a strong fault-tolerance mechanism in underlying distributed systems, namely to be able to quickly recover processing activities as some service nodes crash.
            However, the basis and principles that jobs can be executed with scalability and fault-tolerance is not well studied.
       -Application Optimization:
       Current practice on application optimization for big data analytics jobs is underlying software framework dependent, so that optimization opportunities are only applicable to a specific software framework or a specific system implementation. A bridging model between applications and underlying software frameworks would enable us to gain opportunities of software framework and implementation independent optimization,which can enhance performance and productivity without impairing scalability and fault tolerance. With this bridging model, system designers and application practitioners can focus on a set of general optimization rules regardless of the structures of software frameworks and underlying infrastructures.
      -System Comparison, Simulation and Migration:
       The diverse requirements of various big data analytics applications cause the needs of system comparison and application migration among existing and/or new designed software frameworks. However, without a general abstract model for the processing paradigm of various software frameworks for big data analytics, it is hard to fairly compare different frameworks in several critical aspects, including scalability, fault-tolerance and framework functionality. Additionally, a general model can provide guide to building software framework simulators that are greatly desirable when designing new frameworks or customizing existing frameworks for certain big data analytics applications. Moreover, since a bridging model between applications and various underlying software frameworks is not available, application migration from one software framework to another depends strongly on programmers’ special knowledge of both frameworks and is hard to do in an efficient way. Thus, it is desirable to have guidance for designing automatic tools used for application migration from one software framework to another.
All of above three issues demand a general model that bridges applications and various underlying software frameworks for big data analytics.
4. we propose a candidate for the general model, called DOT, which characterizes the basic behavior of big data analytics and identifies its critical issues.The DOT model also serves as a powerful tool for analyzing, optimizing and deploying software for big data analytics. Three symbols “D”, “O”, and “T” are three matrix representations for distributed data sets, concurrent
data processing operations, and data transformations, respectively. Specifically, in the DOT model, the dataflow of a big data analytics job is represented by a DOT expression containing multiple root building blocks, called elementary DOT blocks, or their extensions, called composite DOT blocks. For every elementary DOT block, a matrix representation is used to abstract basic behavior of computing and communications for a big data analytics job. The DOT model eliminates the data dependency among concurrent tasks executed by concurrent data processing units (called “workers” in the rest of the paper), which is a critical requirement for the purpose of achieving scalability and fault-tolerance of a large distributed system.
5. THE DOT MODEL:  The DOT model consists of three major components to describe a big data analytics job:
     (1) a root building block, called an elementary DOT block: A big data (multi-)set; A set of workers; Mechanisms that regulate the processing paradigm of workers to interact the big data (multi-)set in two steps. 
     (2) an extended building block, called a composite DOT block, that is organized by a group of independent elementary DOT blocks  
     (3) a method that is used for building the dataflow of a big data analytics job with elementary/composite DOT blocks.
      
      An elementary DOT block is illustrated by Figure 1 with a three-layer structure. The bottom layer (D-layer) represents the big data (multi-)set. A big data (multi-)set is divided into n parts (from D1 to Dn) in a distributed system, where each part is a sub-dataset (called a chunk in the rest of the paper). In the middle layer (O-layer), n workers directly process the data (multi-)set and oi is the data-processing operator associated with the ith worker. Each worker only processes a chunk (as shown by the arrow from Di to oiand stores intermediate results. At the top layer (T-layer), a single worker with operator t collects all intermediate results (as shown by the arrows from oi to t, i = 1, . . . , n), then performs the last-stage data transformations based on intermediate results, and finally outputs the ending result.
      Based on the definitions of the composite DOT block, there are three restrictions on communications among workers:
      (1) workers in the O-layer cannot communicate with each other;
      (2) workers in the T-layer cannot communicate with each other; and
      (3) intermediate data transfers from workers in the O-layer to their corresponding workers in the T-layer are the only communications occurring in a composite DOT block.
6. Big Data Analytics Jobs:is described by its dataflow, global information and halting conditions.
    (1) Dataflow of a Job: is represented by a specific or non-specific number of elementary/composite DOT blocks.
    (2) Global Information: need to access some lightweight global information, e.g. system configurations.
    (3) Halting Conditions: determine when or under what conditions a job will stop.
7. Formal Definitions
     7.1 The Elementary DOT Block :
     In the above matrix representation, matrix multiplication follows the row-column pair rule of the conventional matrix product. The multiplication of corresponding elements of the two matrices is defined as: 
    (1) a multiplication between a data chunk Di and an operator f ( f can either be the operator in matrix O or the one in matrix T) means to apply the operator on the chunk, represented by f(Di);
    (2) multiplication between two operators (e.g. f1 × f2) means to form a composition of operators (e.g., f = f2(f1)). In contrast to the original matrix summation, in the DOT model, the summation operator P is replaced by a group operator F. The operation nFi(fi(Di))= (f1(D1),・ ・ ・, fn(Dn)) means to compose a collection of data sets f1(D1) to fn(Dn). It is not required that all elements of the collection locate in a single place.
   7.2 The Composite DOT Block: 
     Given m elementary DOT blocks ~DO1T1 to ~DOmTm, a composite DOT block ~DOT is formulated as:
     7.3  An Algebra for Representing the Dataflow of Big Data Analytics Jobs
     a big data analytics job can be represented by an expression, called a DOT expression:
     For example, a job can be composed by three composite DOT blocks, ~D1O1T1, ~D2O2T2 and ~D3O3T3, where the results of ~D1O1Tand ~D2O2T2 are input of ~D3O3T3. With the algebra defined in this section, the DOT expression of this job is:
      
         A context-free grammar to derive a DOT expression is shown in Figure 5:
 
     With the algebra used for representing the dataflow of a big data analytics job as a DOT expression, the job can be described by a DOT expression, global information and halting conditions.
  
8. Scalability and fault-tolerance

这篇关于DOT--A Matrix Model for Analyzing,Optimizing and Deploying Software for Big Data Analytics in Distri的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/588946

相关文章

论文翻译:arxiv-2024 Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey https://arxiv.org/abs/2406.04244 大规模语言模型的基准数据污染:一项综述 文章目录 大规模语言模型的基准数据污染:一项综述摘要1 引言 摘要 大规模语言模型(LLMs),如GPT-4、Claude-3和Gemini的快

CentOS下mysql数据库data目录迁移

https://my.oschina.net/u/873762/blog/180388        公司新上线一个资讯网站,独立主机,raid5,lamp架构。由于资讯网是面向小行业,初步估计一两年内访问量压力不大,故,在做服务器系统搭建的时候,只是简单分出一个独立的data区作为数据库和网站程序的专区,其他按照linux的默认分区。apache,mysql,php均使用yum安装(也尝试

MVC(Model-View-Controller)和MVVM(Model-View-ViewModel)

1、MVC MVC(Model-View-Controller) 是一种常用的架构模式,用于分离应用程序的逻辑、数据和展示。它通过三个核心组件(模型、视图和控制器)将应用程序的业务逻辑与用户界面隔离,促进代码的可维护性、可扩展性和模块化。在 MVC 模式中,各组件可以与多种设计模式结合使用,以增强灵活性和可维护性。以下是 MVC 各组件与常见设计模式的关系和作用: 1. Model(模型)

[论文笔记]LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

引言 今天带来第一篇量化论文LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale笔记。 为了简单,下文中以翻译的口吻记录,比如替换"作者"为"我们"。 大语言模型已被广泛采用,但推理时需要大量的GPU内存。我们开发了一种Int8矩阵乘法的过程,用于Transformer中的前馈和注意力投影层,这可以将推理所需

使用Spring Boot集成Spring Data JPA和单例模式构建库存管理系统

引言 在企业级应用开发中,数据库操作是非常重要的一环。Spring Data JPA提供了一种简化的方式来进行数据库交互,它使得开发者无需编写复杂的JPA代码就可以完成常见的CRUD操作。此外,设计模式如单例模式可以帮助我们更好地管理和控制对象的创建过程,从而提高系统的性能和可维护性。本文将展示如何结合Spring Boot、Spring Data JPA以及单例模式来构建一个基本的库存管理系统

15 组件的切换和对组件的data的使用

划重点 a 标签的使用事件修饰符组件的定义组件的切换:登录 / 注册 泡椒鱼头 :微辣 <!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta http-equiv="X-UA-

12C 新特性,MOVE DATAFILE 在线移动 包括system, 附带改名 NID ,cdb_data_files视图坏了

ALTER DATABASE MOVE DATAFILE  可以改名 可以move file,全部一个命令。 resue 可以重用,keep好像不生效!!! system照移动不误-------- SQL> select file_name, status, online_status from dba_data_files where tablespace_name='SYSTEM'

SIGMOD-24概览Part7: Industry Session (Graph Data Management)

👇BG3: A Cost Effective and I/O Efficient Graph Database in ByteDance 🏛机构:字节 ➡️领域: Information systems → Data management systemsStorage management 📚摘要:介绍了字节新提出的ByteGraph 3.0(BG3)模型,用来处理大规模图结构数据 背景

java.sql.SQLException: No data found

Java代码如下: package com.accord.utils;import java.sql.Connection;import java.sql.DriverManager;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.ResultSetMetaData;import

FORM的ENCTYPE=multipart/form-data 时request.getParameter()值为null问题的解决

此情况发生于前台表单传送至后台java servlet处理: 问题:当Form需要FileUpload上传文件同时上传表单其他控件数据时,由于设置了ENCTYPE=”multipart/form-data” 属性,后台request.getParameter()获取的值为null 上传文件的参考代码:http://www.runoob.com/jsp/jsp-file-uploading.ht