Tango: distributed data structures over a shared log Mahesh

April 29, 2023

29 Less than a minute

15 Slides629.92 KB

Save Document

Tango: distributed data structures over a shared log Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran Michael Wei, John D. Davis, Sriram Rao, Tao Zou, Aviad Zuck Microsoft Research

big metadata design pattern: distribute data, centralize metadata schedulers, allocators, coordinators, namespaces, indices (e.g. HDFS namenode, SDN controller ) usual plan: harden centralized service later “Coordinator failures will be handled safely using the ZooKeeper service [14].” Fast Crash Recovery in RAMCloud, Ongaro et al., SOSP 2011. “Efforts are also underway to address high availability of a YARN cluster by having passive/active failover of RM to a standby node.” Apache Hadoop YARN: Yet Another Resource Negotiator, Vavilapalli et al., SOCC 2013. “However, adequate resilience can be achieved by applying standard replication techniques to the decision element.” NOX: Towards an Operating System for Networks, Gude et al., Sigcomm CCR 2008. but hardening is difficult!

the abstraction gap for metadata centralized metadata services are built using in-memory data structures (e.g. Java / C# Collections) - state resides in maps, trees, queues, counters, graphs - transactional access to data structures - example: a scheduler atomically moves a node from a free list to an allocation map adding high availability requires different abstractions - move state to external service like ZooKeeper - restructure code to use state machine replication - implement custom replication protocols

the Tango abstraction application application a Tango object 1. Tango objects are easy to use 2. Tango objects are easy to build 3. Tango objects are fast and scalable view in-memory data structure Tango runtime history ordered updates in shared log Tango runtime shared log uncommitted data commit record the shared log is the source of - persistence - availability - elasticity - atomicity and isolation across multiple objects no messages only appends/reads on the shared log!

Tango objects are easy to use implement standard interfaces (Java/C# Collections) linearizability for single operations under the hood: example: curowner ownermap.get(“ledger”); if(curowner.equals(myname)) ledger.add(item);

Tango objects are easy to use implement standard interfaces (Java/C# Collections) linearizability for single operations TX commits if readunder the hood: serializable transactions set (ownermap) has not changed in conflict window example: TR.BeginTX(); curowner ownermap.get(“ledger”); if(curowner.equals(myname)) ledger.add(item); status TR.EndTX(); speculative commit records: each client decides if the TX commits or aborts independently but deterministically [similar to Hyder (Bernstein et al., CIDR 2011)] TX commit record: read-set: (ownermap, ver:2) write-set: (ledger, ver:6)

Tango objects are easy to build 15 LOC persistent, highly available, transactional register class TangoRegister { int oid; object-specific state TangoRuntime T; int state; invoked by Tango runtime void apply(void X) { on EndTX to change state state (int )X; } mutator: updates TX void writeRegister (int newstate) { write-set, appends T update helper(&newstate , sizeof (int) , oid); to shared log } int readRegister () { T query helper(oid); accessor: updates return state; Other examples: TX read-set, } Java ConcurrentMap: 350 LOC returns local state } Apache ZooKeeper: 1000 LOC Apache BookKeeper: 300 LOC simple API exposed by runtime to object: 1 upcall two helper methods arbitrary API exposed by object to application: mutators and accessors

the secret sauce: a fast shared logapplication only a hint! helps shared log API: O append(V) V read(O) trim(O) //GC O check() //tail performance, not required for safety or liveness Tango runtime shared log read from anywhere sequencer append to tail obtain tail # flash cluster the CORFU decentralized shared log [NSDI 2012]: - reads scale linearly with number of flash drives - 600K/s appends (limited by sequencer speed)

a fast shared log isn’t enough service 1 service 2 aggregation A tree A free list C A A C A A B C B A C B A B B B B B C C A B C B C CA B C A B C A C A A allocation table C the playback bottleneck: clients must read all entries inbound NIC is a bottleneck solution: stream abstraction - readnext(streamid) - append(value, streamid1, ) each client only plays entries of interest to it

transactions over streams service 1 aggregation tree A A A free list A skip C skip C A A C C beginTX read A write C endTX A A C A skip service 2 C commit/abort? has A changed? yes, abort decision record with commit/a bort bit skip B B B B B B B C C B skip C allocation table C skip B C C commit/abort? has A changed? don’t know!

evaluation: linearizable operations (latency 1 ms) beefier shared log scaling continues ultimate bottleneck: sequencer adding more clients more reads/sec until shared log is saturated a Tango object provides elasticity for strongly consistent reads constant write load (10K writes/sec), each client adds 10K reads/sec

evaluation: single object txes beefier shared log scaling continues ultimate bottleneck: sequencer adding more clients more transactions until shared log is saturated scales like conventional partitioning but there’s a cap on aggregate throughput each client does transactions over its own TangoMap

evaluation: multi-object txes Tango enables fast, distributed transactions across multiple objects over 100K txes/sec when 16% of txes are cross-partition similar scaling to 2PL without a complex distributed protocol 18 clients, each client hosts its own TangoMap cross-partition tx: client moves element from its own TangoMap to some other client’s TangoMap 13

conclusion Tango objects: data structures backed by a shared log key idea: the shared log does all the heavy lifting (persistence, consistency, atomicity, isolation, history, elasticity ) Tango objects are easy to use, easy to build, and fast! Tango democratizes the construction of highly available metadata services

thank you!

April 29, 2023

29 Less than a minute

Related Articles

Federal Credit and Insurance Programs: Housing By John M. Quigley

Industrial Hygiene Risk Assessments By Kelsey Smith

Risk Communication for Clinicians

Cloud Identification Goes with Dichotomous Key Created by Dr.

STATE OF ALABAMA OFFICE OF INFORMATION TECHNOLOGY IT

Problem DC 10-2, Page 547 The following are auditor judgments