Distributed Computing with Spark Reza Zadeh Thanks to Matei

45 Slides1.31 MB

Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia. Tweet with #fearthespark

Proble m Data growing faster than processing speeds Only solution is to parallelize on large clusters » Wide use in both enterprises and web industry How do we program these things?

Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Machine Learning Example Current State of Spark Ecosystem

Data flow vs. traditional network programming

Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale: » How to split problem across nodes? Must consider network & data locality » How to deal with failures? (inevitable at scale) » Even worse: stragglers (node not failed, but slow) » Ethernet networking not fast » Have to write programs for each machine Rarely used in commodity datacenters

Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks and where to run each task Map » Run parts twice fault recovery Map Reduce Biggest example: MapReduce Reduce Map

Example MapReduce Algorithms Matrix-vector multiplication Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Tall skinny QR Many others!

Why Use a Data Flow Engine? Ease of programming » High-level functions instead of message passing Wide deployment » More common than MPI, especially “near” data Scalability to very largest clusters » Even HPC world is now concerned about resilience Examples: Pig, Hive, Scalding, Storm

Limitations of MapReduce

Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing » State between steps goes to distributed file system » Slow due to replication & disk storage

Example: Iterative Apps file system read file system write file system read iter. 1 file system write . . . iter. 2 Input file system read Input query 1 result 1 query 2 result 2 query 3 result 3 . . . Commonly spend 90% of time

Example: PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector Same file grouped over and over Neighbors (id, edges) Ranks (id, rank) iteration 1 iteration 2 iteration 3

Result While MapReduce is simple, it can require asymptotically more communication or I/O

Spark computing engine

Spark Computing Engine Extends a programming language with a distributed collection data-structure » “Resilient distributed datasets” (RDD) Open source at Apache » Most active community in big data, with 50 companies contributing Clean APIs in Java, Scala, Python, R

Resilient Distributed Datasets (RDDs) Main idea: Resilient Distributed Datasets » Immutable collections of objects, spread across cluster » Statically typed: RDD[T] has objects of type T val sc new SparkContext() val lines sc.textFile("log.txt") // // RDD[String] Transform using standard collection operations val errors val lines.filter( .startsWith("ERROR")) messages errors.map( .split(‘\t’)(2)) lazily evaluated messages.saveAsTextFile("errors.txt") kicks off a computation

Key Idea Resilient Distributed Datasets (RDDs) » Collections of objects across a cluster with user controlled partitioning & storage (memory, disk, .) » Built via parallel transformations (map, filter, ) » The world only lets you make make RDDs such that they can be: Automatically rebuilt on failure

Python, Java, Scala, R // Scala: val lines sc.textFile(.) lines.filter(x x.contains(“ERROR”)).count() // Java (better in java8!): JavaRDD String lines sc.textFile(.); lines.filter(new Function String, Boolean () { Boolean call(String s) { return s.contains(“error”); } }).count();

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x y) .filter(lambda (type, count): count 10) Input file map reduce filter

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x y) .filter(lambda (type, count): count 10) Input file map reduce filter

Partitioning RDDs know their partitioning functions file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x y) .filter(lambda (type, count): count 10) Input file map reduce Known to be hash-partitioned filter Also known

Machine Learning example

Logistic Regression data spark.textFile(.).map(readPoint).cache() w numpy.random.rand(D) for i in range(iterations): gradient data.map(lambda p: (1 / (1 exp(- ‐ p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda a, b: a b) w ‐ gradient print “Final w: %s” % w

Logistic Regression Results 4000 3500 3000 2500 2000 1500 1000 500 0 Running Time (s) 110 s / iteration Hadoop Spark 1 5 10 20 Number of Iterations 100 GB of data on 50 m1.xlarge EC2 machines 30 first iteration 80 s further iterations 1 s

60 11.5 40 29.7 40.7 58.1 Iteration time (s) 80 68.8 Behavior with Less 100 RAM 20 0 0% 25% 50% 75% % of working set in memory 100%

Benefit for Users Same engine performs data extraction, model training and interactive queries DFS read DFS write DFS read parse train query Spark DFS DFS read query DFS write train DFS read parse Separate engines DFS write

State of the Spark ecosystem

Spark Community Most active open source community in big data 200 developers, 50 companies contributing Contributors in past year 150 100 50 0 Giraph Storm

Project Activity 1400 1200 50000 0 HDF S Storm 100000 YARN 150000 MapReduce Storm YARN MapReduce 200 HDF S 200000 800 400 300000 250000 1000 600 350000 Spar k Spar k 1600 0 Commits Lines of Code Changed Activity in past 6 months

Continuing Growth Contributors per month to Spark source: ohloh.net

Built-in libraries

Standard Library for Big Python Scala Data Big data apps lack libraries R of common algorithms Spark’s generality support for multiple languages make it suitable to offer this SQL ML Java graph Cor e Much of future activity will be in these libraries

A General Platform Standard libraries included with Spark Spark Spark SQL Streaming structured real-time GraphX graph MLlib machine learning Spark Core

Machine Learning Library (MLlib) points context.sql(“select model latitude, KMeans.train(points, 10) longitude from tweets”) 40 contributors in past year

MLlib algorithms classification: logistic regression, linear SVM, naïve Bayes, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS

GraphX 3 6

GraphX General graph processing library Build graph using RDDs of nodes and edges Large library of graph algorithms with composable steps 3 7

GraphX Algorithms Collaborative Filtering » Alternating Least Squares » Stochastic Structured Prediction Gradient Descent » » Loopy TensorBelief Propagation Factorization » Max-Product Linear Programs Semi» Gibbs Sampling supervised ML » Graph SSL » CoEM Community Detection » TriangleCounting » K-core Graph Analytics Decomposition » » PageRank K-Truss » Personalized PageRank » Shortest Path » Graph Coloring Classification » Neural Networks 3 8

Spark Run a streaming computation as a Streaming series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD opera ; o n s Finally, the processed results of the RDD opera ; o n s are returned in batches live data stream Spark Streaming batches of X seconds processed results Spark 3

Spark Run a streaming computation as a Streaming series of very small, deterministic batch jobs Batch sizes as low as ½ second, latency 1 second Poten; a l f o r combining batch processing and streaming processing in the same system live data stream Spark Streaming batches of X seconds processed results Spark 4

Spark SQL // Run SQL statements val teenagers context.sql( "SELECT name FROM people WHERE age 13 AND age 19") // The results of SQL queries are RDDs of Row objects val names teenagers.map(t "Name: " t(0)).collect()

Spark SQL Enables loading & querying structured data in Spark From Hive: c HiveContext(sc) rows c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year 2013).collect() From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”) c.sql(“select text, user.name from tweets”) tweets.json {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }}

Conclusions

Spark and Research Spark has all its roots in research, so we hope to keep incorporating new ideas!

Conclusion Data flow engines are becoming an important platform for numerical algorithms While early models like MapReduce were inefficient, new ones like Spark close this gap More info: spark.apache.org

Back to top button