TOWARDS RELIABLE APPLICATION DEPLOYMENT IN THE CLOUD Ruichuan Chen

20 Slides874.47 KB

TOWARDS RELIABLE APPLICATION DEPLOYMENT IN THE CLOUD Ruichuan Chen Joint work with Istemi Ekin Akkus, Bimal Viswanath, Ivica Rimac, Volker Hilt

Today, how to reliably deploy an application into cloud? Applications are moved from self-maintained infrastructure to the cloud. How to achieve high reliability in the cloud? Redundancy! Region 1 App Region 2 App Region N App Common Infrastructure (Storage, Power Supply, etc.)

Cloud service outages are still common

Existing efforts Cloud providers detect, localize or tolerate failures via: – – Diagnosis systems, e.g., NSDMiner, Orion, Sherlock, Sieve. Fault-tolerant systems, e.g., F10, NetPilot. Existing efforts address the problem after the outage occurs. – – Require human intervention. Prolonged failure recovery.

Our proposal -- reCloud reCloud takes proactive actions to prevent cloud service outages. – Enable cloud provider to deploy applications with a user-specified reliability level. – Work with complex applications such as micro-service applications. – Balance between reliability, application performance, and resource utilization. – Achieve all of the above with no changes to existing cloud infrastructure.

reCloud workflow Dependency Acquisition User Reliability requirements (e.g., three 9s, 2-of-3 redundancy, etc) Generate Initial Deployment Plan Topology, failure probabilities Dependency , etc. DB reClou d System Assess Reliability Evolve New Deployment Plan No Check if User Requirements Met? Yes Cloud Syste m Application Deployment Engine

Step 0: specify reliability requirements User specifies reliability requirements: – N : total number of application instances to be deployed. – K : minimal number of deployed instances to be alive. – Rdesired: desired reliability score, i.e., the probability that at least K out of N deployed instances are alive. – Tmax: maximum time to search for a reliable deployment plan.

Step 0’: acquire dependency information Three types of infrastructure components: – Hardware, software, and network components. Cloud providers normally use cloud management platforms to: – – Monitor the topology among various components. Measure the failure probability of various components. Example: cloud data center is organized as a fat-tree. Core Switch Agg Host Edge Internet Hosts Switches and hosts may share additional common dependencies.

Step 1: generate initial deployment plan reCloud generates an initial deployment plan by placing application instances onto random hosts. – Deployment plan is a choice of hosts to deploy application instances. Example: – User requires 1-of-2 redundancy. Core Switch Agg Edge Hosts Host Host for deployment

Step 2: assess reliability of a deployment plan Fix the application’s deployment plan. Generate failure states for all infrastructure components based on their failure probabilities. Core Switch Agg Host Edge / Hosts Host for deployment Failed switch / host

Step 2: assess reliability of a deployment plan Test reliability in the generated topology with failed components. – Consider routing protocol, and user-specified K-of-N redundancy. Core Switch Agg Host Edge / Host for deployment Failed switch / host Hosts – unreacha reacha This deployment plan is considered reliable because user requires 1-of-2 redundancy. ble ble Generate component failure states for X rounds, and test reliability in these rounds. If the deployment plan is considered reliable in Y rounds, then its reliability score is Y/X.

Step 2: assess reliability of a deployment plan Need to generate failure states for each component in each round. – This is quite expensive. reCloud uses dagger sampling to generate failure states. Example: A component fails with probability of 0.2, meaning 1 failure every 5 rounds on average. – Monte-Carlo 5 random numbersfailed/alive to produce failed/alive sampling: failed/alivegenerate failed/alive failed/alive failures for 5 rounds. – alive failed alive alive Dagger sampling: generate only 1 randomalive integer in [1,5] to decide in which round the component fails.

Step 3: search for reliable deployment plan There are a huge number of potential deployment plans. reCloud uses simulated annealing to search for a reliable deployment plan. – – Evolve new deployment plans. Accept not only more reliable deployment plans, but also less reliable ones with some probability. Search ends until find a deployment plan which satisfies user-specified reliability, or time-out.

Step 3: search for reliable deployment plan Cloud data centers are normally designed to create symmetry. reCloud uses network transformations technique to check the equivalence of multiple deployment plans. equivalent No need to assess the equivalent deployment plan.

reCloud workflow (recap) Dependency Acquisition User Reliability requirements (e.g., three 9s, 2-of-3 redundancy, etc) Generate Initial Deployment Plan Topology, failure probabilities Dependency , etc. DB reClou d System Assess Reliability Evolve New Deployment Plan No Check if User Requirements Met? Yes Cloud Syste m Application Deployment Engine

Evaluation We have implemented a functional prototype ( 5.3K lines of Java code). We evaluate reCloud with 4 data center topologies, from tiny scale to large scale.

Evaluation How efficient is dagger sampling to generate failure states for components? – 1 to 2 orders of magnitude faster than Monte-Carlo sampling.

Evaluation How efficient is reCloud to assess a given deployment plan? – – 270ms even in a large-scale data center. Redundancy level does not affect performance significantly.

Evaluation How efficient is reCloud to search for a reliable deployment plan? – Need only 30 seconds to find a deployment plan that is (at least) 10X more reliable than the current practice (CP) in a large-scale data center with 27K hosts. Example: To achieve a 4-of-5 redundancy, the current practice (CP) can find a deployment plan with 99.62% reliability (i.e., 33.3 hours downtime per year). reCloud can find a deployment plan with 99.97% reliability (i.e., 2.6 hours downtime per year), within 30 seconds.

Summary reCloud finds an application’s reliable deployment plan that fulfills user’s requirements, before the application gets deployed. – – – Dagger sampling to generate failures when assessing reliability of a given deployment plan. Simulated annealing to explore the huge space of potential deployment plans. Network transformations to check the equivalence of different deployment plans. reCloud can also: – – – – Work with complex applications such as micro-service applications. Balance between reliability, application performance, and resource utilization. Achieve all of the above with no changes to existing cloud infrastructure. Please refer to the paper

Back to top button