COMP5349_2014 Semester1_week7_lecture_07_DataflowSystems

Outline !?Motivation

!?UC Berkeley BDAS

!?The Stratosphere

Today's Cloud Analytics Stack !?… mostly focused on large, on-disk datasets: great for batch , but slow

COMP5349 "Cloud Computing" - 2014 (U. R?hm) 07-3

Application

Storage

Data Processing

Infrastructure

[slide by Ion Stoica, UCB, 2013]

Mesos Storm … Spark

Spark

Streaming Shark BlinkDB HDFS Tachyon Hadoop HIVE

Pig

Spark Graph MLbase YARN Resource Manager

Example:)Log)Mining)

!?Load error messages from a log into memory, then interactively search for various patterns

COMP5349 "Cloud Computing" - 2014 (U. R?hm) 07-13

lines = spark.textFile(“hdfs://...”) errors = lines.filter (_.startsWith(“ERROR”))

messages = errors.map (_.split(‘\t’)(2)) cachedMsgs = messages.cache () Block 1 Block 2 Block 3 Worker Worker

Worker

Driver cachedMsgs.filter (_.contains(“foo”)).count

cachedMsgs.filter (_.contains(“bar”)).count

. . .

tasks%results%Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Cached RDD

Parallel operation

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

[slide by Zaharia et al., UCB amplab]

RDD Fault Tolerance

!?RDDs maintain lineage information that can be used to reconstruct lost partitions

!?Example:

COMP5349 "Cloud Computing" - 2014 (U. R?hm) 07-15

cachedMsgs = textFile(...).filter (_.contains(“error”)) .map (_.split(‘\t’)(2)) .cache ()

HdfsRDD

path: hdfs://… FilteredRDD func:

contains(...) MappedRDD func: split(…) CachedRDD

[slide by Zaharia et al., UCB amplab]

Example Logistic Regression Code

val data = spark.textFile(...).map (readPoint ).cache ()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = data.map (p =>

(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce (_ + _) w -= gradient }

println("Final w: " + w)

Load data in memory once

Initial parameter vector Repeated MapReduce steps to do gradient descent

further iterations 1 s

Common Data Types (predefined)

CoGroup Operator

operator with two inputs.

requires the specification of a record key for each input.

Records of both inputs are grouped on their key. Groups with matching keys are handed together to the user function.

Application: Outer Join