COMP5349_2014 Semester1_week7_lecture_07_DataflowSystems
Outline !?Motivation
!?UC Berkeley BDAS
!?The Stratosphere
Today's Cloud Analytics Stack !?… mostly focused on large, on-disk datasets: great for batch , but slow
COMP5349 "Cloud Computing" - 2014 (U. R?hm) 07-3
Application
Storage
Data Processing
Infrastructure
[slide by Ion Stoica, UCB, 2013]
Mesos Storm … Spark
Spark
Streaming Shark BlinkDB HDFS Tachyon Hadoop HIVE
Pig
Spark Graph MLbase YARN Resource Manager
Example:)Log)Mining)
!?Load error messages from a log into memory, then interactively search for various patterns
COMP5349 "Cloud Computing" - 2014 (U. R?hm) 07-13
lines = spark.textFile(“hdfs://...”) errors = lines.filter (_.startsWith(“ERROR”))
messages = errors.map (_.split(‘\t’)(2)) cachedMsgs = messages.cache () Block 1 Block 2 Block 3 Worker Worker
Worker
Driver cachedMsgs.filter (_.contains(“foo”)).count
cachedMsgs.filter (_.contains(“bar”)).count
. . .
tasks%results%Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Cached RDD
Parallel operation
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
[slide by Zaharia et al., UCB amplab]
RDD Fault Tolerance
!?RDDs maintain lineage information that can be used to reconstruct lost partitions
!?Example:
COMP5349 "Cloud Computing" - 2014 (U. R?hm) 07-15
cachedMsgs = textFile(...).filter (_.contains(“error”)) .map (_.split(‘\t’)(2)) .cache ()
HdfsRDD
path: hdfs://… FilteredRDD func:
contains(...) MappedRDD func: split(…) CachedRDD
[slide by Zaharia et al., UCB amplab]
Example Logistic Regression Code
val data = spark.textFile(...).map (readPoint ).cache ()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map (p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce (_ + _) w -= gradient }
println("Final w: " + w)
Load data in memory once
Initial parameter vector Repeated MapReduce steps to do gradient descent
further iterations 1 s
Common Data Types (predefined)
CoGroup Operator
operator with two inputs.
requires the specification of a record key for each input.
Records of both inputs are grouped on their key. Groups with matching keys are handed together to the user function.
Application: Outer Join