Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team owen@yahoo-inc

22 Slides1,009.50 KB

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team [email protected]

Problem How do you scale up applications? – Run jobs processing 100’s of terabytes of data – Takes 11 days to read on 1 computer Need lots of cheap computers – Fixes speed problem (15 minutes on 1000 computers), but – Reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure – Must be efficient and reliable CCA – Oct 2008

Solution Open Source Apache Project Hadoop Core includes: – Distributed File System - distributes data – Map/Reduce - distributes application Written in Java Runs on – Linux, Mac OS/X, Windows, and Solaris – Commodity hardware CCA – Oct 2008

Commodity Hardware Cluster Typically in 2 level architecture – Nodes are commodity PCs – 40 nodes/rack – Uplink from rack is 8 gigabit – Rack-internal is 1 gigabit CCA – Oct 2008

Distributed File System Single namespace for entire cluster – Managed by a single namenode. – Files are single-writer and append-only. – Optimized for streaming reads of large files. Files are broken in to large blocks. – Typically 128 MB – Replicated to several datanodes, for reliability Client talks to both namenode and datanodes – Data is not sent through the namenode. – Throughput of file system scales nearly linearly with the number of nodes. Access from Java, C, or command line. CCA – Oct 2008

Block Placement Default is 3 replicas, but settable Blocks are placed (writes are pipelined): – On same node – On different rack – On the other rack Clients read from closest replica If the replication for a block drops below target, it is automatically re-replicated. CCA – Oct 2008

Data Correctness Data is checked with CRC32 File Creation – Client computes checksum per 512 byte – DataNode stores the checksum File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas Periodic Validation CCA – Oct 2008

Map/Reduce Map/Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: – cat input grep sort uniq -c cat output – Input Map Shuffle & Sort Reduce Output Efficiency from – Streaming through data, reducing seeks – Pipelining A good fit for a lot of applications – Log processing – Web index building CCA – Oct 2008

Map/Reduce Dataflow CCA – Oct 2008

Map/Reduce features Java and C APIs – In Java use Objects, while in C bytes Each task can process data sets larger than RAM Automatic re-execution on failure – In a large cluster, some nodes are always slow or flaky – Framework re-executes failed tasks Locality optimizations – Map-Reduce queries HDFS for locations of input data – Map tasks are scheduled close to the inputs when possible CCA – Oct 2008

How is Yahoo using Hadoop? We started with building better applications – Scale up web scale batch applications (search, ads, ) – Factor out common code from existing systems, so new applications will be easier to write – Manage the many clusters we have more easily The mission now includes research support – Build a huge data warehouse with many Yahoo! data sets – Couple it with a huge compute cluster and programming models to make using the data easy – Provide this as a service to our researchers – We are seeing great results! Experiments can be run much more quickly in this environment CCA – Oct 2008

Running Production WebMap Search needs a graph of the “known” web – Invert edges, compute link text, whole graph heuristics Periodic batch job using Map/Reduce – Uses a chain of 100 map/reduce jobs Scale – 1 trillion edges in graph – Largest shuffle is 450 TB – Final output is 300 TB compressed – Runs on 10,000 cores – Raw disk used 5 PB Written mostly using Hadoop’s C interface CCA – Oct 2008

Research Clusters The grid team runs the research clusters as a service to Yahoo researchers Mostly data mining/machine learning jobs Most research jobs are *not* Java: – 42% Streaming Uses Unix text processing to define map and reduce – 28% Pig Higher level dataflow scripting language – 28% Java – 2% C CCA – Oct 2008

NY Times Needed offline conversion of public domain articles from 1851-1922. Used Hadoop to convert scanned images to PDF Ran 100 Amazon EC2 instances for around 24 hours 4 TB of input 1.5 TB of output Published 1892, copyright New York Times CCA – Oct 2008

Terabyte Sort Benchmark Started by Jim Gray at Microsoft in 1998 Sorting 10 billion 100 byte records Hadoop won the general category in 209 seconds – 910 nodes – 2 quad-core Xeons @ 2.0Ghz / node – 4 SATA disks / node – 8 GB ram / node – 1 gb ethernet / node – 40 nodes / rack – 8 gb ethernet uplink / rack Previous records was 297 seconds Only hard parts were: – Getting a total order – Converting the data generator to map/reduce CCA – Oct 2008

Hadoop clusters We have 20,000 machines running Hadoop Our largest clusters are currently 2000 nodes Several petabytes of user data (compressed, unreplicated) We run hundreds of thousands of jobs every month CCA – Oct 2008

Research Cluster Usage CCA – Oct 2008

Hadoop Community Apache is focused on project communities – Users – Contributors write patches – Committers can commit patches too – Project Management Committee vote on new committers and releases too Apache is a meritocracy Use, contribution, and diversity is growing – But we need and want more! CCA – Oct 2008

Size of Releases CCA – Oct 2008

Who Uses Hadoop? Amazon/A9 AOL Facebook Fox interactive media Google / IBM New York Times PowerSet (now Microsoft) Quantcast Rackspace/Mailtrust Veoh Yahoo! More at http://wiki.apache.org/hadoop/PoweredBy CCA – Oct 2008

What’s Next? Better scheduling – Pluggable scheduler – Queues for controlling resource allocation between groups Splitting Core into sub-projects – HDFS, Map/Reduce, Hive Total Order Sampler and Partitioner Table store library HDFS and Map/Reduce security High Availability via Zookeeper Get ready for Hadoop 1.0 CCA – Oct 2008

Q&A For more information: – Website: http://hadoop.apache.org/core – Mailing lists: [email protected] [email protected] – IRC: #hadoop on irc.freenode.org CCA – Oct 2008

Back to top button