A New Parallel Framework for Machine Learning Joseph Gonzalez Joint

March 16, 2023

24 Less than a minute

69 Slides2.54 MB

Save Document

A New Parallel Framework for Machine Learning Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Alex Smola Guy Blelloch Joe Hellerstein David O’Hallaron Carnegie Mellon

How will we design and implement parallel learning systems? Carnegie Mellon

We could use . Threads, Locks, & Messages “low level parallel primitives” Carnegie Mellon

Threads, Locks, and Messages atestudents raduexperts GML repeatedly solve the same parallel design challenges: Implement and debug complex parallel system Tune for a specific parallel platform Two months later the conference paper contains: “We implemented in parallel.” The resulting code: is difficult to maintain is difficult to extend couples learning model to parallel implementation 6

. a better answer: Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions Carnegie Mellon

MapReduce – Map Phase 1 2 . 9 CPU 1 4 2 . 3 CPU 2 2 1 . 3 CPU 3 2 5 . 8 CPU 4 Embarrassingly Parallel independent computation No Communication needed 8

MapReduce – Map Phase 8 4 . 3 2 4 . 1 CPU 1 1 2 . 9 1 8 . 4 CPU 2 4 2 . 3 8 4 . 4 CPU 3 2 1 . 3 CPU 4 2 5 . 8 Image Features 9

MapReduce – Map Phase 6 7 . 5 1 7 . 5 CPU 1 1 2 . 9 2 4 . 1 1 4 . 9 CPU 2 4 2 . 3 8 4 . 3 3 4 . 3 CPU 3 2 1 . 3 1 8 . 4 CPU 4 2 5 . 8 8 4 . 4 Embarrassingly Parallel independent computation No Communication needed 10

MapReduce – Reduce Phase Attractive Face Statistics Ugly Face Statistics 17 26 . 31 22 26 . 26 CPU 1 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 CPU 2 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 4 3 4 . 3 Image Features 11

Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-ParallelGraph-Parallel Is there more Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics to Machine Learning Lasso ? Label Propagation Belief Kernel Propagation Methods Tensor Factorization Deep Belief Networks PageRank Neural Networks 12

Concrete Example Label Propagation Carnegie Mellon

Label Propagation Algorithm Social Arithmetic: Sue Ann 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 80% Cameras 20% Biking 40% I Like: 60% Cameras, 40% Biking Profile Me Recurrence Algorithm: Likes[i] å 50% 50% Cameras 50% Biking Wij Likes[ j] jÎFriends[i] iterate until convergence Parallelism: Compute all Likes[i] in parallel Carlos 10% 30% Cameras 70% Biking

Properties of Graph Parallel Algorithms Dependency Graph Factored Computation Iterative Computation What I Like What My Friends Like

Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-ParallelGraph-Parallel Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics ? Map Reduce? Lasso Label Propagation Belief Kernel Propagation Methods Tensor Factorization Deep Belief Networks PageRank Neural Networks 16

Why not use Map-Reduce for Graph Parallel Algorithms? Carnegie Mellon

Data Dependencies Map-Reduce does not efficiently express dependent data Independent Data Rows User must code substantial data transformations Costly data replication

Iterative Algorithms Map-Reduce not efficiently express iterative algorithms: Iterations Data Data CPU 1 Data CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data Data Data Data Data CPU 3 Data Barrier Data CPU 3 Barrier Data CPU 3 Barrier Slow Processor CPU 1 Data Data

MapAbuse: Iterative MapReduce Only a subset of data needs computation: Iterations Data Data CPU 1 Data CPU 1 Data CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data Data Barrier Data Data Data CPU 3 Data Barrier Data CPU 3 Barrier Data CPU 3 Data

MapAbuse: Iterative MapReduce System is not optimized for iteration: Iterations Data Data Data Data Data CPU 3 Data Data Data Data CPU 2 CPU 3 Data Data Data Data Data CPU 1 CPU 2 CPU 3 Disk Penalty Data Data Data Data Startup Penalty CPU 2 Data CPU 1 Disk Penalty Data Disk Penalty StartupPenalty Data Startup Penalty CPU 1 Data Data Data Data Data Data Data

Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-ParallelGraph-Parallel Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics Pregel Map Reduce? (Giraph)? SVM Lasso Kernel Methods Tensor Factorization Deep Belief Networks Belief Propagation PageRank Neural Networks 22

Pregel (Giraph) Bulk Synchronous Parallel Model: Compute Communicate Barrier

Problem Bulk synchronous computation can be highly inefficient. Example: Loopy Belief Propagation 25

Loopy Belief Propagation (Loopy BP) Iteratively estimate the “beliefs” about vertices – Read in messages – Updates marginal estimate (belief) – Send updated out messages Repeat for all variables until convergence 26

Bulk Synchronous Loopy BP Often considered embarrassingly parallel – Associate processor with each vertex – Receive all messages – Update all beliefs – Send all messages Proposed by: – – – – Brunton et al. CRV’06 Mendiburu et al. GECC’07 Kang,et al. LDMTA’10 27

Sequential Computational Structure 28

Hidden Sequential Structure 29

Hidden Sequential Structure Evidence Evidence Running Time: Time for a single parallel iteration Number of Iterations 30

Optimal Sequential Algorithm Running Time Bulk Synchronous Forward-Backward Gap 2n2/p p 2n 2n p 1 Optimal Parallel n p 2 31

The Splash Operation Generalize the optimal chain algorithm: to arbitrary cyclic graphs: 1) Grow a BFS Spanning tree with fixed size 2) Forward Pass computing all messages at each vertex 3) Backward Pass computing all messages at each vertex 32

Runtime in Seconds Data-Parallel Algorithms can be Inefficient 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Optimized in Memory Bulk Synchronous Asynchronous Splash BP 1 2 3 4 5 6 7 8 Number of CPUs The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.

The Need for a New Abstraction Map-Reduce is not well suited for Graph-Parallelism Data-ParallelGraph-Parallel Map Reduce Feature Extraction Pregel (Giraph) Cross Validation SVM Computing Sufficient Statistics Kernel Methods Tensor Factorization Deep Belief Networks Belief Propagation PageRank Neural Networks Lasso 34

What is GraphLab? Carnegie Mellon

The GraphLab Framework Graph Based Data Representation Scheduler Update Functions User Computation Consistency Model 36

Data Graph A graph with arbitrary data (C Objects) associated with each vertex and edge. Graph: Social Network Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights 37

Implementing the Data Graph Multicore Setting In Memory Relatively Straight Forward vertex data(vid) data edge data(vid,vid) data neighbors(vid) vid list Challenge: Fast lookup, low overhead Solution: Dense data-structures Fixed Vdata&Edata types Immutable graph structure Cluster Setting In Memory Partition Graph: ParMETIS or Random Cuts A B C D Cached Ghosting Node 1 Node 2 A B A B C D C D

The GraphLab Framework Graph Based Data Representation Scheduler Update Functions User Computation Consistency Model 39

Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scopeof the vertex label prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope; Wijdata Likes[ j]; //Likes[i] Update the vertex å jÎFriends[i] // Reschedule Neighbors if needed if Likes[i] changes then reschedule neighbors of(i); } 40

The GraphLab Framework Graph Based Data Representation Scheduler Update Functions User Computation Consistency Model 41

The Scheduler Scheduler The scheduler determines the order that vertices are updated. CPU 1 e a b hi h c b a f i d g j k CPU 2 The process repeats until the scheduler is empty. 42

Implementing the Schedulers Multicore Setting Cluster Setting Multicore scheduler on each node Approximate FiFo/Priority Random placement Work stealing CPU 2 Queue Queue 22 Queue Queue 4 4 Queue Queue 3 3 Queue Queue 2 2 CPU 1 Queue Queue 11 Node 1 CPU 1 CPU 2 CPU 3 CPU 4 Queue Queue 1 1 Schedules only “local” vertices Exchange update functions v1 f(v1) f(v2) Node 2 CPU 1 CPU 2 Queue Queue 22 Fine-grained locking Atomic operations Queue Queue 11 Challenging! v2

The GraphLab Framework Graph Based Data Representation Scheduler Update Functions User Computation Consistency Model 45

Ensuring Race-Free Code How much can computation overlap?

Importance of Consistency Many algorithms require strict consistency, or performs significantly better under strict consistency. Alternating Least Squares Error (RMSE) 12 10 Inconsistent Updates 8 6 Consistent Updates 4 2 0 0 10 20 # Iterations 30

Importance of Consistency Machine learning algorithms require “model debugging” Build Test Debug Tweak Model

GraphLab Ensures Sequential Consistency For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 time Parallel CPU 2 Sequential Single CPU 49

Consistency Rules Data Guaranteed sequential consistency for all update functions 51

Full Consistency 52

Obtaining More Parallelism 53

Edge Consistency Safe CPU 1 Read CPU 2 54

Consistency Through R/W Locks Read/Write locks: Full Consistency Write Write Write Canonical Lock Ordering Edge Consistency Read Write Read Read Write

Consistency Through R/W Locks Multicore Setting: Pthread R/W Locks Distributed Setting: Distributed Locking Prefetch Locks and Data Node 1 Node 2 Allow computation to proceed while locks/data are Lock Pipeline Data Graph Partition

Consistency Through Scheduling Edge Consistency Model: Two vertices can be Updated simultaneously if they do not share an edge. Graph Coloring: Two vertices can be assigned the same color if they do not share an edge. Barrier Phase 3 Barrier Phase 2 Barrier Phase 1

The GraphLab Framework Graph Based Data Representation Scheduler Update Functions User Computation Consistency Model 58

Algorithms Implemented PageRank Loopy Belief Propagation Gibbs Sampling CoEM Graphical Model Parameter Learning Probabilistic Matrix/Tensor Factorization Alternating Least Squares Lasso with Sparse Features Support Vector Machines with Sparse Features Label-Propagation

Shared Memory Experiments Shared Memory Setting 16 Core Workstation Carnegie Mellon 60

Loopy Belief Propagation 3D retinal image denoising Vertices: 1 Million Edges: 3 Million Data Graph Update Function: Loopy BP Update Equation Scheduler: Approximate Priority Consistency Model: Edge Consistency 61

Loopy Belief Propagation Better 16 14 Optimal 12 Speedup 10 8 SplashBP 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of CPUs 15.5x speedup 62

CoEM (Rosie Jones, 2005) Named Entity Recognition Task Is “Dog” an animal? Is “Catalina” a place? Vertices: 2 Million Edges: 200 Million Hadoop the dog X ran quickly Australia travelled to X Catalina Island 95 Cores X is pleasant 7.5 hrs

CoEM (Rosie Jones, 2005) Better 16 14 Optimal 12 Hadoop 95 Cores 7.5 hrs 16 Cores 30 min 10 Speedup GraphLab 8 6 6x fewer CPUs! GraphLabCoEM 15x Faster! 6 12 4 2 0 0 2 4 8 10 14 16 Number of CPUs 64

Experiments Amazon EC2 High-Performance Nodes Carnegie Mellon 65

Video Cosegmentation Segments mean the same Gaussian EM clustering BP on 3D grid Model: 10.5 million nodes, 31 million edges

Video Coseg. Speedups

Prefetching Data & Locks

Matrix Factorization Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Users Netflix d Movies

Netflix Speedup Increasing size of the matrix factorization 16 Ideal Speedup 14 d 100 (159.91 IPB) 12 d 50 (85.68 IPB) 10 d 20 (48.72 IPB) d 5 (44.85 IPB) 8 6 4 2 1 4 8 16 24 32 40 #Nodes 48 56 64

The Cost of Hadoop 10 2 D 100 10 10 GraphLab 10 10 1 D 50 1 Cost( ) Cost( ) Hadoop 0 10 10 1 10 D 20 1 10 2 10 Runtime(s) 3 10 4 D 5 0 1 0.92 0.94 0.96 0.98 Error (RMSE) 1

Summary An abstraction tailored to Machine Learning Targets Graph-Parallel Algorithms Naturally expresses Data/computational dependencies Dynamic iterative computation Simplifies parallel algorithm design Automatically ensures data consistency Achieves state-of-the-art parallel performance on a variety of problems 72

Checkout GraphLab Documentation Code Tutorials http://graphlab.org Questions & Feedback [email protected] Carnegie Mellon 73

Current/Future Work Out-of-core Storage Hadoop/HDFS Integration Graph Construction Graph Storage Launching GraphLab from Hadoop Fault Tolerance through HDFS Checkpoints Sub-scope parallelism Address the challenge of very high degree nodes Improved graph partitioning Support for dynamic graph structure

March 16, 2023

24 Less than a minute

Related Articles

Operations Management Location Strategies

Convoy Operations and TTPs Lessons Learned

Powered Industrial Trucks (PIT) Bureau of Workers’ Compensation

Nursing Update Joan Keith, MEd, RN, LSN, NCSN School Nurse

Correlation and Regression Cal State Northridge 427 Ainsworth

Mini. Full version Download here crmrkt/qPNNyW