hadoopdb 介绍

合集下载

Hadoop概述

Hadoop概述⼀、Hadoop概述Hadoop实现了⼀个分布式⽂件系统，简称HDFS。

Hadoop在数据提取、变形和加载（ETL）⽅⾯有着天然的优势。

Hadoop的HDFS实现了⽂件的⼤批量存储，Hadoop的MapReduce功能实现了将单个任务打碎，将碎⽚任务（Map）发送到多个节点上，之后再以单个数据集的形式加载（Reduce）到数据仓库⾥。

Hadoop的ETL可批量操作数据，使处理结果直接⾛向存储。

Hadoop有以下特点：1、⾼可靠性。

因为它假设计算元素和存储会失败，因此它维护多个⼯作数据副本，能够确保针对失败的节点重新分布处理。

2、⾼扩展性。

Hadoop是在可⽤的计算机集簇间分配数据并完成计算任务的，这些集簇可⽅便的扩展到数以千计的节点中。

3、⾼效性。

它以并⾏的⽅式⼯作，能够在节点之间动态移动数据，并保证各个节点动态平衡，因此处理速度⾮常快。

4、⾼容错性。

Hadoop能够⾃动保存数据的多个副本，能够⾃动将失败的任务重新分配。

5、可伸缩性。

Hadoop能够处理PB级数据。

6、低成本。

Hadoop是开源的，项⽬软件成本⼤⼤降低。

Hadoop的组成：1、最底部的是HDFS（Hadoop Distribute File System），它存储Hadoop集群中所有存储节点上的⽂件，是数据存储的主要载体。

它由Namenode和DataNode组成。

2、HDFS的上⼀层是MapReduce引擎，该引擎由JobTrackers和TaskTrackers组成。

它通过MapReduce过程实现了对数据的处理。

3、Yarn实现了任务分配和集群资源管理的任务。

它由ResourceManager、nodeManager和ApplicationMaster组成。

Hadoop由以上三个部分组成，下⾯我们就这三个组成部分详细介绍：1、HDFSHadoop HDFS 的架构是基于⼀组特定的节点构建的，（1）名称节点（NameNode仅⼀个）负责管理⽂件系统名称空间和控制外部客户机的访问。

Hadoop - 介绍

FS/namespace/meta ops
Clint
NameNode
Second NameNode
Namespace backup
Heartbeats,balancing,replication etc
DataNode
Data serving
DataNode
DataNode
DataNode
DataNode
Google 云计算
MapReduce BigTable Chubby
GFS
Hadoop可以做什么？
案例1：我想知道过去100年中每年的最高温度分别是多少？
这是一个非常典型的代表，该问题里边包含了大量的信息数据。
针对于气象数据来说，全球会有非常多的数据采集点，每个采集点在24小时中会以不同的频率进行采样，并且以每年持续365 天这样的过程，一直要收集 100年的数据信息。然后在这 100年的所有数据中，抽取出每年最高的温度值，最终生成结果。该过程会伴随着大量的数据分析工作，并且会有大量的半结构化数据作为基础研究对象。如果使用高配大型主机（ Unix环境）计算，完成时间是以几十分钟或小时为单位的数量级，而通过 Hadoop完成，在合理的节点和架构下，只需要“秒”级。
HIVE
ODBC Command Line JDBC Thrift Server Metastore Driver （Compiler,Optimizer,Executor ） Hive 包括
元数据存储（Metastore）驱动（Driver）
查询编译器（Query Compiler）
1. HDFS（Hadoop分布式文件系统）
HDFS：源自于Google的GFS论文，发表于2003年10月， HDFS是GFS克隆版。是Hadoop体系中数据存储管理的基础。它是一个高度容错的系统，能检测和应对硬件故障，用于在低成本的通用硬件上运行。HDFS简化了文件的一致性模型，通过流式数据访问，提供高吞吐量应用程序数据访问功能，适合带有大型数据集的应用程序。 Client：切分文件；访问HDFS；与NameNode交互，获取文件位置信息；与DataNode交互，读取和写入数据。 NameNode：Master节点，在hadoop1.X中只有一个，管理HDFS的名称空间和数据块映射信息，配置副本策略，处理客户端请求。 DataNode：Slave节点，存储实际的数据，汇报存储信息给NameNode。 Secondary NameNode：辅助NameNode，分担其工作量；定期合并fsimage和fsedits，推送给NameNode；紧急情况下，可辅助恢复NameNode，但Secondary NameNode并非NameNode的热备。

hadoop 原理

hadoop 原理Hadoop原理Hadoop是一个开源的分布式计算框架，它能够处理大规模数据集并且能够提供高可靠性、高可扩展性和高效率的计算能力。

本文将详细介绍Hadoop的原理。

一、Hadoop的概述1.1 Hadoop的定义Hadoop是一个基于Java语言编写的分布式计算框架，它由Apache 基金会开发和维护。

1.2 Hadoop的特点- 可以处理大规模数据集- 具有高可靠性、高可扩展性和高效率- 支持多种数据存储方式- 支持多种计算模型和编程语言- 易于部署和管理1.3 Hadoop的组件Hadoop由以下几个组件组成：- HDFS（Hadoop Distributed File System）：分布式文件系统，用于存储大规模数据集。

- MapReduce：分布式计算框架，用于对大规模数据进行并行处理。

- YARN（Yet Another Resource Negotiator）：资源管理器，用于协调整个集群中各个应用程序之间的资源使用。

二、HDFS原理2.1 HDFS概述HDFS是一个分布式文件系统，它可以在集群中存储大规模数据集。

它采用了主从架构，其中NameNode作为主节点，负责管理整个文件系统的元数据，而DataNode作为从节点，负责存储数据块。

2.2 HDFS文件存储原理HDFS将一个文件分成多个数据块进行存储。

每个数据块的大小默认为128MB，可以通过配置进行修改。

当一个文件被上传到HDFS中时，它会被分成多个数据块，并且这些数据块会被复制到不同的DataNode上进行备份。

2.3 HDFS读写原理当客户端需要读取一个文件时，它会向NameNode发送请求。

NameNode返回包含该文件所在DataNode信息的列表给客户端。

客户端根据这些信息直接与DataNode通信获取所需的数据。

当客户端需要上传一个文件时，它会向NameNode发送请求，并且将该文件分成多个数据块进行上传。

大数据存储方式概述

大数据存储方式概述标题：大数据存储方式概述引言概述：随着信息技术的不断发展，大数据已经成为当今社会中一个重要的信息资源。

为了有效管理和利用大数据，各种存储方式应运而生。

本文将就大数据存储方式进行概述，帮助读者更好地了解大数据存储的相关知识。

一、分布式文件系统存储方式1.1 HDFS（Hadoop分布式文件系统）：HDFS是Apache Hadoop项目中的一个分布式文件系统，适用于存储大规模数据，并且具有高可靠性和高扩展性。

1.2 GFS（Google文件系统）：GFS是Google开发的分布式文件系统，采用主从架构，能够有效地处理大规模数据的存储和访问。

1.3 Ceph：Ceph是一个开源的分布式存储系统，具有高可用性和高性能，支持对象存储、块存储和文件系统存储。

二、NoSQL数据库存储方式2.1 MongoDB：MongoDB是一种面向文档的NoSQL数据库，适用于存储半结构化数据，并且具有高性能和可扩展性。

2.2 Cassandra：Cassandra是一个高度可扩展的NoSQL数据库，适用于分布式存储大规模数据，并且支持高可用性和容错性。

2.3 Redis：Redis是一个开源的内存数据库，适用于缓存和实时数据处理，具有快速的读写速度和高性能。

三、列式数据库存储方式3.1 HBase：HBase是一个基于Hadoop的列式数据库，适用于存储大规模结构化数据，并且支持高可用性和高性能。

3.2 Vertica：Vertica是一种高性能列式数据库，适用于数据仓库和实时分析，具有快速的查询速度和高压缩比。

3.3 ClickHouse：ClickHouse是一个开源的列式数据库，适用于实时分析和数据仓库，具有高性能和可扩展性。

四、云存储方式4.1 AWS S3（Amazon Simple Storage Service）：AWS S3是亚马逊提供的云存储服务，适用于存储大规模数据，并且具有高可靠性和安全性。

Hadoop基础知识培训

挖掘算法(Mahout) 搜索(Solr) Sqoop 数据仓库(Hive) 数据库(Hbase) 批处理(Pig) MapReduce Tez Spark Storm
存储+计算(HDFS2+Yarn)
集中存储和计算的主要瓶颈
Oracle IBM
EMC存储
scale-up(纵向扩展)
➢计算能力和机器数量成正比 ➢IO能力和机器数量成非正比
多,Intel,Cloudera,hortonworks,MapR • 硬件基于X86服务器,价格低,厂商多 • 可以自行维护,降低维护成本 • 在互联网有大规模成功案例(BAT)
总结
• Hadoop平台在构建数据云(DAAS)平台有天然的架构和成本的优势
成本投资估算:从存储要求计算所需硬件及系统软件资源（5000万用户为例）
往HDFS中写入文件
• 首要的目标当然是数据快速的并行处理。为了实现这个目标，我们需要竟可能多的机器同时工作。
• Cient会和名称节点达成协议（通常是TCP 协议）然后得到将要拷贝数据的3个数据节点列表。然后Client将会把每块数据直接写入数据节点中（通常是TCP 协议）。名称节点只负责提供数据的位置和数据在族群中的去处（文件系统元数据）。
• 第二个和第三个数据节点运输在同一个机架中，这样他们之间的传输就获得了高带宽和低延时。只到这个数据块被成功的写入3个节点中，下一个就才会开始。
• 如果名称节点死亡，二级名称节点保留的文件可用于恢复名称节点。
• 每个数据节点既扮演者数据存储的角色又冲当与他们主节点通信的守护进程。守护进程隶属于Job Tracker，数据节点归属于名称节点。

Hadoop 生态系统介绍

Hadoop 生态系统介绍Hadoop生态系统是一个开源的大数据处理平台，它由Apache基金会支持和维护，可以在大规模的数据集上实现分布式存储和处理。

Hadoop生态系统是由多个组件和工具构成的，包括Hadoop 核心，Hive、HBase、Pig、Spark等。

接下来，我们将对每个组件及其作用进行介绍。

一、Hadoop核心Hadoop核心是整个Hadoop生态系统的核心组件，它主要由两部分组成，一个是Hadoop分布式文件系统（HDFS），另一个是MapReduce编程模型。

HDFS是一个高可扩展性的分布式文件系统，可以将海量数据存储在数千台计算机上，实现数据的分散储存和高效访问。

MapReduce编程模型是基于Hadoop的针对大数据处理的一种模型，它能够对海量数据进行分布式处理，使大规模数据分析变得容易和快速。

二、HiveHive是一个开源的数据仓库系统，它使用Hadoop作为其计算和存储平台，提供了类似于SQL的查询语法，可以通过HiveQL 来查询和分析大规模的结构化数据。

Hive支持多种数据源，如文本、序列化文件等，同时也可以将结果导出到HDFS或本地文件系统。

三、HBaseHBase是一个开源的基于Hadoop的列式分布式数据库系统，它可以处理海量的非结构化数据，同时也具有高可用性和高性能的特性。

HBase的特点是可以支持快速的数据存储和检索，同时也支持分布式计算模型，提供了易于使用的API。

四、PigPig是一个基于Hadoop的大数据分析平台，提供了一种简单易用的数据分析语言（Pig Latin语言），通过Pig可以进行数据的清洗、管理和处理。

Pig将数据处理分为两个阶段：第一阶段使用Pig Latin语言将数据转换成中间数据，第二阶段使用集合行处理中间数据。

五、SparkSpark是一个快速、通用的大数据处理引擎，可以处理大规模的数据，支持SQL查询、流式数据处理、机器学习等多种数据处理方式。

hadoop的生态体系及各组件的用途

hadoop的生态体系及各组件的用途
Hadoop是一个生态体系，包括许多组件，以下是其核心组件和用途：
1. Hadoop Distributed File System (HDFS)：这是Hadoop的分布式文件系统，用于存储大规模数据集。

它设计为高可靠性和高吞吐量，并能在低成本的通用硬件上运行。

通过流式数据访问，它提供高吞吐量应用程序数据访问功能，适合带有大型数据集的应用程序。

2. MapReduce：这是Hadoop的分布式计算框架，用于并行处理和分析大规模数据集。

MapReduce模型将数据处理任务分解为Map和Reduce两个阶段，从而在大量计算机组成的分布式并行环境中有效地处理数据。

3. YARN：这是Hadoop的资源管理和作业调度系统。

它负责管理集群资源、调度任务和监控应用程序。

4. Hive：这是一个基于Hadoop的数据仓库工具，提供SQL-like查询语言和数据仓库功能。

5. Kafka：这是一个高吞吐量的分布式消息队列系统，用于实时数据流的收集和传输。

6. Pig：这是一个用于大规模数据集的数据分析平台，提供类似SQL的查询语言和数据转换功能。

7. Ambari：这是一个Hadoop集群管理和监控工具，提供可视化界面和集群配置管理。

此外，HBase是一个分布式列存数据库，可以与Hadoop配合使用。

HBase 中保存的数据可以使用MapReduce来处理，它将数据存储和并行计算完美地结合在一起。

hadoop大数据技术基础 python版

Hadoop大数据技术基础 python版随着互联网技术的不断发展和数据量的爆炸式增长，大数据技术成为了当前互联网行业的热门话题之一。

Hadoop作为一种开源的大数据处理评台，其在大数据领域的应用日益广泛。

而Python作为一种简洁、易读、易学的编程语言，也在大数据分析与处理中扮演着不可或缺的角色。

本文将介绍Hadoop大数据技术的基础知识，并结合Python编程语言，分析其在大数据处理中的应用。

一、Hadoop大数据技术基础1. Hadoop简介Hadoop是一种用于存储和处理大规模数据的开源框架，它主要包括Hadoop分布式文件系统（HDFS）和MapReduce计算框架。

Hadoop分布式文件系统用于存储大规模数据，而MapReduce计算框架则用于分布式数据处理。

2. Hadoop生态系统除了HDFS和MapReduce之外，Hadoop生态系统还包括了许多其他组件，例如HBase、Hive、Pig、ZooKeeper等。

这些组件形成了一个完整的大数据处理评台，能够满足各种不同的大数据处理需求。

3. Hadoop集群Hadoop通过在多台服务器上构建集群来实现数据的存储和处理。

集群中的各个计算节点共同参与数据的存储和计算，从而实现了大规模数据的分布式处理。

二、Python在Hadoop大数据处理中的应用1. Hadoop StreamingHadoop Streaming是Hadoop提供的一个用于在MapReduce中使用任意编程语言的工具。

通过Hadoop Streaming，用户可以借助Python编写Map和Reduce的程序，从而实现对大规模数据的处理和分析。

2. Hadoop连接Python除了Hadoop Streaming外，Python还可以通过Hadoop提供的第三方库和接口来连接Hadoop集群，实现对Hadoop集群中数据的读取、存储和计算。

这为Python程序员在大数据处理领域提供了更多的可能性。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

HadoopDB:An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsAzza Abouzeid1,Kamil Bajda-Pawlikowski1,Daniel Abadi1,Avi Silberschatz1,Alexander Rasin21Y ale University,2Brown University{azza,kbajda,dna,avi}@;alexr@ABSTRACTThe production environment for analytical data management ap-plications is rapidly changing.Many enterprises are shifting away from deploying their analytical databases on high-end proprietary machines,and moving towards cheaper,lower-end,commodity hardware,typically arranged in a shared-nothing MPP architecture, often in a virtualized environment inside public or private“clouds”. At the same time,the amount of data that needs to be analyzed is exploding,requiring hundreds to thousands of machines to work in parallel to perform the analysis.There tend to be two schools of thought regarding what tech-nology to use for data analysis in such an environment.Propo-nents of parallel databases argue that the strong emphasis on per-formance and efﬁciency of parallel databases makes them well-suited to perform such analysis.On the other hand,others argue that MapReduce-based systems are better suited due to their supe-rior scalability,fault tolerance,andﬂexibility to handle unstructured data.In this paper,we explore the feasibility of building a hybrid system that takes the best features from both technologies;the pro-totype we built approaches parallel databases in performance and efﬁciency,yet still yields the scalability,fault tolerance,andﬂexi-bility of MapReduce-based systems.1.INTRODUCTIONThe analytical database market currently consists of$3.98bil-lion[25]of the$14.6billion database software market[21](27%) and is growing at a rate of10.3%annually[25].As business“best-practices”trend increasingly towards basing decisions off data and hard facts rather than instinct and theory,the corporate thirst for systems that can manage,process,and granularly analyze data is becoming insatiable.Venture capitalists are very much aware of this trend,and have funded no fewer than a dozen new companies in recent years that build specialized analytical data management soft-ware(e.g.,Netezza,Vertica,DATAllegro,Greenplum,Aster Data, Infobright,Kickﬁre,Dataupia,ParAccel,and Exasol),and continue to fund them,even in pressing economic times[18].At the same time,the amount of data that needs to be stored and processed by analytical database systems is exploding.This is Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment.To copy otherwise,or to republish,to post on servers or to redistribute to lists,requires a fee and/or special permission from the publisher,ACM.VLDB‘09,August24-28,2009,Lyon,FranceCopyright2009VLDB Endowment,ACM000-0-00000-000-0/00/00.partly due to the increased automation with which data can be pro-duced(more business processes are becoming digitized),the prolif-eration of sensors and data-producing devices,Web-scale interac-tions with customers,and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis.It is no longer uncommon to hear of companies claiming to load more than a terabyte of structured data per day into their analytical database system and claiming data warehouses of size more than a petabyte[19].Given the exploding data problem,all but three of the above mentioned analytical database start-ups deploy their DBMS on a shared-nothing architecture(a collection of independent,possibly virtual,machines,each with local disk and local main memory, connected together on a high-speed network).This architecture is widely believed to scale the best[17],especially if one takes hardware cost into account.Furthermore,data analysis workloads tend to consist of many large scan operations,multidimensional ag-gregations,and star schema joins,all of which are fairly easy to parallelize across nodes in a shared-nothing network.Analytical DBMS vendor leader,Teradata,uses a shared-nothing architecture. Oracle and Microsoft have recently announced shared-nothing an-alytical DBMS products in their Exadata1and Madison projects, respectively.For the purposes of this paper,we will call analytical DBMS systems that deploy on a shared-nothing architecture paral-lel databases2.Parallel databases have been proven to scale really well into the tens of nodes(near linear scalability is not uncommon).However, there are very few known parallel databases deployments consisting of more than one hundred nodes,and to the best of our knowledge, there exists no published deployment of a parallel database with nodes numbering into the thousands.There are a variety of reasons why parallel databases generally do not scale well into the hundreds of nodes.First,failures become increasingly common as one adds more nodes to a system,yet parallel databases tend to be designed with the assumption that failures are a rare event.Second,parallel databases generally assume a homogeneous array of machines,yet it is nearly impossible to achieve pure homogeneity at scale.Third, until recently,there have only been a handful of applications that re-quired deployment on more than a few dozen nodes for reasonable performance,so parallel databases have not been tested at larger scales,and unforeseen engineering hurdles await.As the data that needs to be analyzed continues to grow,the num-ber of applications that require more than one hundred nodes is be-ginning to multiply.Some argue that MapReduce-based systems 1To be precise,Exadata is only shared-nothing in the storage layer. 2This is slightly different than textbook deﬁnitions of parallel databases which sometimes include shared-memory and shared-disk architectures as well.[8]are best suited for performing analysis at this scale since they were designed from the beginning to scale to thousands of nodes in a shared-nothing architecture,and have had proven success in Google’s internal operations and on the TeraSort benchmark[7]. Despite being originally designed for a largely different application (unstructured text data processing),MapReduce(or one of its pub-licly available incarnations such as open source Hadoop[1])can nonetheless be used to process structured data,and can do so at tremendous scale.For example,Hadoop is being used to manage Facebook’s2.5petabyte data warehouse[20].Unfortunately,as pointed out by DeWitt and Stonebraker[9], MapReduce lacks many of the features that have proven invaluable for structured data analysis workloads(largely due to the fact that MapReduce was not originally designed to perform structured data analysis),and its immediate gratiﬁcation paradigm precludes some of the long term beneﬁts ofﬁrst modeling and loading data before processing.These shortcomings can cause an order of magnitude slower performance than parallel databases[23].Ideally,the scalability advantages of MapReduce could be com-bined with the performance and efﬁciency advantages of parallel databases to achieve a hybrid system that is well suited for the an-alytical DBMS market and can handle the future demands of data intensive applications.In this paper,we describe our implementa-tion of and experience with HadoopDB,whose goal is to serve as exactly such a hybrid system.The basic idea behind HadoopDB is to use MapReduce as the communication layer above multiple nodes running single-node DBMS instances.Queries are expressed in SQL,translated into MapReduce by extending existing tools,and as much work as possible is pushed into the higher performing sin-gle node databases.One of the advantages of MapReduce relative to parallel databases not mentioned above is cost.There exists an open source version of MapReduce(Hadoop)that can be obtained and used without cost.Yet all of the parallel databases mentioned above have a nontrivial cost,often coming with sevenﬁgure price tags for large installations.Since it is our goal to combine all of the advantages of both data analysis approaches in our hybrid system, we decided to build our prototype completely out of open source components in order to achieve the cost advantage as well.Hence, we use PostgreSQL as the database layer and Hadoop as the communication layer,Hive as the translation layer,and all code we add we release as open source[2].One side effect of such a design is a shared-nothing version of PostgreSQL.We are optimistic that our approach has the potential to help transform any single-node DBMS into a shared-nothing par-allel database.Given our focus on cheap,large scale data analysis,our tar-get platform is virtualized public or private“cloud computing”deployments,such as Amazon’s Elastic Compute Cloud(EC2) or VMware’s private VDC-OS offering.Such deployments signiﬁcantly reduce up-front capital costs,in addition to lowering operational,facilities,and hardware costs(through maximizing current hardware utilization).Public cloud offerings such as EC2 also yield tremendous economies of scale[14],and pass on some of these savings to the customer.All experiments we run in this paper are on Amazon’s EC2cloud offering;however our techniques are applicable to non-virtualized cluster computing grid deployments as well.In summary,the primary contributions of our work include:•We extend previous work[23]that showed the superior per-formance of parallel databases relative to Hadoop.While this previous work focused only on performance in an ideal set-ting,we add fault tolerance and heterogeneous node experi-ments to demonstrate some of the issues with scaling parallel databases.•We describe the design of a hybrid system that is designed to yield the advantages of both parallel databases and MapRe-duce.This system can also be used to allow single-node databases to run in a shared-nothing environment.•We evaluate this hybrid system on a previously published benchmark to determine how close it comes to parallel DBMSs in performance and Hadoop in scalability.2.RELATED WORKThere has been some recent work on bringing together ideas from MapReduce and database systems;however,this work focuses mainly on language and interface issues.The Pig project at Yahoo [22],the SCOPE project at Microsoft[6],and the open source Hive project[11]aim to integrate declarative query constructs from the database community into MapReduce-like software to allow greater data independence,code reusability,and automatic query optimiza-tion.Greenplum and Aster Data have added the ability to write MapReduce functions(instead of,or in addition to,SQL)over data stored in their parallel database products[16].Although theseﬁve projects are without question an important step in the hybrid direction,there remains a need for a hybrid solu-tion at the systems level in addition to at the language and interface levels.This paper focuses on such a systems-level hybrid.3.DESIRED PROPERTIESIn this section we describe the desired properties of a system de-signed for performing data analysis at the(soon to be more com-mon)petabyte scale.In the following section,we discuss how par-allel database systems and MapReduce-based systems do not meet some subset of these desired properties.Performance.Performance is the primary characteristic that com-mercial database systems use to distinguish themselves from other solutions,with marketing literature oftenﬁlled with claims that a particular solution is many times faster than the competition.A factor of ten can make a big difference in the amount,quality,and depth of analysis a system can do.High performance systems can also sometimes result in cost sav-ings.Upgrading to a faster software product can allow a corporation to delay a costly hardware upgrade,or avoid buying additional com-pute nodes as an application continues to scale.On public cloud computing platforms,pricing is structured in a way such that one pays only for what one uses,so the vendor price increases linearly with the requisite storage,network bandwidth,and compute power. Hence,if data analysis software product A requires an order of mag-nitude more compute units than data analysis software product B to perform the same task,then product A will cost(approximately) an order of magnitude more than B.Efﬁcient software has a direct effect on the bottom line.Fault Tolerance.Fault tolerance in the context of analytical data workloads is measured differently than fault tolerance in the con-text of transactional workloads.For transactional workloads,a fault tolerant DBMS can recover from a failure without losing any data or updates from recently committed transactions,and in the con-text of distributed databases,can successfully commit transactions and make progress on a workload even in the face of worker node failures.For read-only queries in analytical workloads,there are neither write transactions to commit,nor updates to lose upon node failure.Hence,a fault tolerant analytical DBMS is simply one thatdoes not have to restart a query if one of the nodes involved in query processing fails.Given the proven operational beneﬁts and resource consumption savings of using cheap,unreliable commodity hardware to build a shared-nothing cluster of machines,and the trend towards extremely low-end hardware in data centers[14],the probability of a node failure occurring during query processing is increasing rapidly.This problem only gets worse at scale:the larger the amount of data that needs to be accessed for analytical queries,the more nodes are required to participate in query processing.This further increases the probability of at least one node failing during query execution.Google,for example,reports an average of1.2 failures per analysis job[8].If a query must restart each time a node fails,then long,complex queries are difﬁcult to complete. Ability to run in a heterogeneous environment.As described above,there is a strong trend towards increasing the number of nodes that participate in query execution.It is nearly impossible to get homogeneous performance across hundreds or thousands of compute nodes,even if each node runs on identical hardware or on an identical virtual machine.Part failures that do not cause com-plete node failure,but result in degraded hardware performance be-come more common at scale.Individual node disk fragmentation and software conﬁguration errors can also cause degraded perfor-mance on some nodes.Concurrent queries(or,in some cases,con-current processes)further reduce the homogeneity of cluster perfor-mance.On virtualized machines,concurrent activities performed by different virtual machines located on the same physical machine can cause2-4%variation in performance[5].If the amount of work needed to execute a query is equally di-vided among the nodes in a shared-nothing cluster,then there is a danger that the time to complete the query will be approximately equal to time for the slowest compute node to complete its assigned task.A node with degraded performance would thus have a dis-proportionate effect on total query time.A system designed to run in a heterogeneous environment must take appropriate measures to prevent this from occurring.Flexible query interface.There are a variety of customer-facing business intelligence tools that work with database software and aid in the visualization,query generation,result dash-boarding,and advanced data analysis.These tools are an important part of the analytical data management picture since business analysts are of-ten not technically advanced and do not feel comfortable interfac-ing with the database software directly.Business Intelligence tools typically connect to databases using ODBC or JDBC,so databases that want to work with these tools must accept SQL queries through these interfaces.Ideally,the data analysis system should also have a robust mech-anism for allowing the user to write user deﬁned functions(UDFs) and queries that utilize UDFs should automatically be parallelized across the processing nodes in the shared-nothing cluster.Thus, both SQL and non-SQL interface languages are desirable.4.BACKGROUND AND SHORTFALLS OFA V AILABLE APPROACHESIn this section,we give an overview of the parallel database and MapReduce approaches to performing data analysis,and list the properties described in Section3that each approach meets.4.1Parallel DBMSsParallel database systems stem from research performed in the late1980s and most current systems are designed similarly to the early Gamma[10]and Grace[12]parallel DBMS research projects.These systems all support standard relational tables and SQL,and implement many of the performance enhancing techniques devel-oped by the research community over the past few decades,in-cluding indexing,compression(and direct operation on compressed data),materialized views,result caching,and I/O sharing.Most (or even all)tables are partitioned over multiple nodes in a shared-nothing cluster;however,the mechanism by which data is parti-tioned is transparent to the end-user.Parallel databases use an op-timizer tailored for distributed workloads that turn SQL commands into a query plan whose execution is divided equally among multi-ple nodes.Of the desired properties of large scale data analysis workloads described in Section3,parallel databases best meet the“perfor-mance property”due to the performance push required to compete on the open market,and the ability to incorporate decades worth of performance tricks published in the database research commu-nity.Parallel databases can achieve especially high performance when administered by a highly skilled DBA who can carefully de-sign,deploy,tune,and maintain the system,but recent advances in automating these tasks and bundling the software into appliance (pre-tuned and pre-conﬁgured)offerings have given many parallel databases high performance out of the box.Parallel databases also score well on theﬂexible query interface property.Implementation of SQL and ODBC is generally a given, and many parallel databases allow UDFs(although the ability for the query planner and optimizer to parallelize UDFs well over a shared-nothing cluster varies across different implementations). However,parallel databases generally do not score well on the fault tolerance and ability to operate in a heterogeneous environ-ment properties.Although particular details of parallel database implementations vary,their historical assumptions that failures are rare events and“large”clusters mean dozens of nodes(instead of hundreds or thousands)have resulted in engineering decisions that make it difﬁcult to achieve these properties.Furthermore,in some cases,there is a clear tradeoff between fault tolerance and performance,and parallel databases tend to choose the performance extreme of these tradeoffs.For example, frequent check-pointing of completed sub-tasks increase the fault tolerance of long-running read queries,yet this check-pointing reduces performance.In addition,pipelining intermediate results between query operators can improve performance,but can result in a large amount of work being lost upon a failure.4.2MapReduceMapReduce was introduced by Dean et.al.in2004[8]. Understanding the complete details of how MapReduce works is not a necessary prerequisite for understanding this paper.In short, MapReduce processes data distributed(and replicated)across many nodes in a shared-nothing cluster via three basic operations. First,a set of Map tasks are processed in parallel by each node in the cluster without communicating with other nodes.Next,data is repartitioned across all nodes of the cluster.Finally,a set of Reduce tasks are executed in parallel by each node on the partition it receives.This can be followed by an arbitrary number of additional Map-repartition-Reduce cycles as necessary.MapReduce does not create a detailed query execution plan that speciﬁes which nodes will run which tasks in advance;instead,this is determined at runtime.This allows MapReduce to adjust to node failures and slow nodes on theﬂy by assigning more tasks to faster nodes and reassigning tasks from failed nodes.MapReduce also checkpoints the output of each Map task to local disk in order to minimize the amount of work that has to be redone upon a failure.Of the desired properties of large scale data analysis workloads,MapReduce best meets the fault tolerance and ability to operate in heterogeneous environment properties.It achieves fault tolerance by detecting and reassigning Map tasks of failed nodes to other nodes in the cluster(preferably nodes with replicas of the input Map data).It achieves the ability to operate in a heterogeneous environ-ment via redundant task execution.Tasks that are taking a long time to complete on slow nodes get redundantly executed on other nodes that have completed their assigned tasks.The time to complete the task becomes equal to the time for the fastest node to complete the redundantly executed task.By breaking tasks into small,granular tasks,the effect of faults and“straggler”nodes can be minimized. MapReduce has aﬂexible query interface;Map and Reduce func-tions are just arbitrary computations written in a general-purpose language.Therefore,it is possible for each task to do anything on its input,just as long as its output follows the conventions deﬁned by the model.In general,most MapReduce-based systems(such as Hadoop,which directly implements the systems-level details of the MapReduce paper)do not accept declarative SQL.However,there are some exceptions(such as Hive).As shown in previous work,the biggest issue with MapReduce is performance[23].By not requiring the user toﬁrst model and load data before processing,many of the performance enhancing tools listed above that are used by database systems are not possible. Traditional business data analytical processing,that have standard reports and many repeated queries,is particularly,poorly suited for the one-time query processing model of MapReduce.Ideally,the fault tolerance and ability to operate in heterogeneous environment properties of MapReduce could be combined with the performance of parallel databases systems.In the following sec-tions,we will describe our attempt to build such a hybrid system.5.HADOOPDBIn this section,we describe the design of HadoopDB.The goal of this design is to achieve all of the properties described in Section3. The basic idea behind HadoopDB is to connect multiple single-node database systems using Hadoop as the task coordinator and network communication layer.Queries are parallelized across nodes using the MapReduce framework;however,as much of the single node query work as possible is pushed inside of the corresponding node databases.HadoopDB achieves fault tolerance and the ability to operate in heterogeneous environments by inheriting the scheduling and job tracking implementation from Hadoop,yet it achieves the performance of parallel databases by doing much of the query processing inside of the database engine.5.1Hadoop Implementation BackgroundAt the heart of HadoopDB is the Hadoop framework.Hadoop consits of two layers:(i)a data storage layer or the Hadoop Dis-tributed File System(HDFS)and(ii)a data processing layer or the MapReduce Framework.HDFS is a block-structuredﬁle system managed by a central NameNode.Individualﬁles are broken into blocks of aﬁxed size and distributed across multiple DataNodes in the cluster.The NameNode maintains metadata about the size and location of blocks and their replicas.The MapReduce Framework follows a simple master-slave ar-chitecture.The master is a single JobTracker and the slaves or worker nodes are TaskTrackers.The JobTracker handles the run-time scheduling of MapReduce jobs and maintains information on each TaskTracker’s load and available resources.Each job is bro-ken down into Map tasks based on the number of data blocks that require processing,and Reduce tasks.The JobTracker assigns tasks to TaskTrackers based on locality and load balancing.It achievesFigure1:The Architecture of HadoopDBlocality by matching a TaskTracker to Map tasks that process data local to it.It load-balances by ensuring all available TaskTrackers are assigned tasks.TaskTrackers regularly update the JobTracker with their status through heartbeat messages.The InputFormat library represents the interface between the storage and processing layers.InputFormat implementations parse text/binaryﬁles(or connect to arbitrary data sources)and transform the data into key-value pairs that Map tasks can process.Hadoop provides several InputFormat implementations including one that allows a single JDBC-compliant database to be accessed by all tasks in one job in a given cluster.5.2HadoopDB’s ComponentsHadoopDB extends the Hadoop framework(see Fig.1)by pro-viding the following four components:5.2.1Database ConnectorThe Database Connector is the interface between independent database systems residing on nodes in the cluster and TaskTrack-ers.It extends Hadoop’s InputFormat class and is part of the Input-Format Implementations library.Each MapReduce job supplies the Connector with an SQL query and connection parameters such as: which JDBC driver to use,query fetch size and other query tuning parameters.The Connector connects to the database,executes the SQL query and returns results as key-value pairs.The Connector could theoretically connect to any JDBC-compliant database that resides in the cluster.However,different databases require different read query optimizations.We implemented connectors for MySQL and PostgreSQL.In the future we plan to integrate other databases including open-source column-store databases such as MonetDB and InfoBright.By extending Hadoop’s InputFormat,we integrate seamlessly with Hadoop’s MapReduce Framework.To the frame-work,the databases are data sources similar to data blocks in HDFS.5.2.2CatalogThe catalog maintains metainformation about the databases.This includes the following:(i)connection parameters such as database location,driver class and credentials,(ii)metadata such as data sets contained in the cluster,replica locations,and data partition-ing properties.The current implementation of the HadoopDB catalog stores its metainformation as an XMLﬁle in HDFS.Thisﬁle is accessed by the JobTracker and TaskTrackers to retrieve information necessaryto schedule tasks and process data needed by a query.In the future, we plan to deploy the catalog as a separate service that would work in a way similar to Hadoop’s NameNode.5.2.3Data LoaderThe Data Loader is responsible for(i)globally repartitioning data on a given partition key upon loading,(ii)breaking apart single node data into multiple smaller partitions or chunks and(iii)ﬁnally bulk-loading the single-node databases with the chunks.The Data Loader consists of two main components:Global Hasher and Local Hasher.The Global Hasher executes a custom-made MapReduce job over Hadoop that reads in raw dataﬁles stored in HDFS and repartitions them into as many parts as the number of nodes in the cluster.The repartitioning job does not incur the sorting overhead of typical MapReduce jobs.The Local Hasher then copies a partition from HDFS into the localﬁle system of each node and secondarily partitions theﬁle into smaller sized chunks based on the maximum chunk size setting. The hashing functions used by both the Global Hasher and the Local Hasher differ to ensure chunks are of a uniform size.They also differ from Hadoop’s default hash-partitioning function to en-sure better load balancing when executing MapReduce jobs over the data.5.2.4SQL to MapReduce to SQL(SMS)Planner HadoopDB provides a parallel database front-end to data analysts enabling them to process SQL queries.The SMS planner extends Hive[11].Hive transforms HiveQL,a variant of SQL,into MapReduce jobs that connect to tables stored asﬁles in HDFS.The MapReduce jobs consist of DAGs of rela-tional operators(such asﬁlter,select(project),join,aggregation) that operate as iterators:each operator forwards a data tuple to the next operator after processing it.Since each table is stored as a separateﬁle in HDFS,Hive assumes no collocation of tables on nodes.Therefore,operations that involve multiple tables usually require most of the processing to occur in the Reduce phase of a MapReduce job.This assumption does not completely hold in HadoopDB as some tables are collocated and if partitioned on the same attribute,the join operation can be pushed entirely into the database layer.To understand how we extended Hive for SMS as well as the dif-ferences between Hive and SMS,weﬁrst describe how Hive creates an executable MapReduce job for a simple GroupBy-Aggregation query.Then,we describe how we modify the execution plan for HadoopDB by pushing most of the query processing logic into the database layer.Consider the following query:SELECT YEAR(saleDate),SUM(revenue)FROM sales GROUP BY YEAR(saleDate);Hive processes the above SQL query in a series of phases:(1)The parser transforms the query into an Abstract Syntax Tree.(2)The Semantic Analyzer connects to Hive’s internal catalog, the MetaStore,to retrieve the schema of the sales table.It also populates different data structures with meta information such as the Deserializer and InputFormat classes required to scan the table and extract the necessaryﬁelds.(3)The logical plan generator then creates a DAG of relational operators,the query plan.(4)The optimizer restructures the query plan to create a more optimized plan.For example,it pushesﬁlter operators closer to the table scan operators.A key function of the optimizer is to break up the plan into Map or Reduce phases.In particular,it adds a Repar-tition operator,also known as a Reduce Sink operator,before Join or GroupBy operators.These operators mark the Map and ReduceFigure2:(a)MapReduce job generated by Hive(b)MapReduce job generated by SMS assuming sales is par-titioned by YEAR(saleDate).This feature is still unsup-ported(c)MapReduce job generated by SMS assumingno partitioning of salesphases of a query plan.The Hive optimizer is a simple,na¨ıve,rule-based optimizer.It does not use cost-based optimization techniques. Therefore,it does not always generate efﬁcient query plans.This is another advantage of pushing as much as possible of the query pro-cessing logic into DBMSs that have more sophisticated,adaptive or cost-based optimizers.(5)Finally,the physical plan generator converts the logical query plan into a physical plan executable by one or more MapReduce jobs.Theﬁrst and every other Reduce Sink operator marks a tran-sition from a Map phase to a Reduce phase of a MapReduce job and the remaining Reduce Sink operators mark the start of new MapRe-duce jobs.The above SQL query results in a single MapReduce job with the physical query plan illustrated in Fig.2(a).The boxes stand for the operators and the arrows represent theﬂow of data.(6)Each DAG enclosed within a MapReduce job is serialized into an XML plan.The Hive driver then executes a Hadoop job. The job reads the XML plan and creates all the necessary operator objects that scan data from a table in HDFS,and parse and process one tuple at a time.The SMS planner modiﬁes Hive.In particular we intercept the normal Hiveﬂow in two main areas:(i)Before any query execution,we update the MetaStore with references to our database tables.Hive allows tables to exist exter-nally,outside HDFS.The HadoopDB catalog,Section5.2.2,pro-vides information about the table schemas and required Deserial-izer and InputFormat classes to the MetaStore.We implemented these specialized classes.(ii)After the physical query plan generation and before the ex-ecution of the MapReduce jobs,we perform two passes over the physical plan.In theﬁrst pass,we retrieve dataﬁelds that are actu-ally processed by the plan and we determine the partitioning keys used by the Reduce Sink(Repartition)operators.In the second pass,we traverse the DAG bottom-up from table scan operators to the output or File Sink operator.All operators until theﬁrst repar-tition operator with a partitioning key different from the database’s key are converted into one or more SQL queries and pushed into the database layer.SMS uses a rule-based SQL generator to recre-ate SQL from the relational operators.The query processing logic that could be pushed into the database layer ranges from none(each。