MIGRATION AND SCHEDULING LECTURE 17* 14-848 (CLOUD INFRASTRUCTURE)

June 8, 2023

24 Less than a minute

16 Slides195.43 KB

Save Document

MIGRATION AND SCHEDULING LECTURE 17* 14-848 (CLOUD INFRASTRUCTURE) * FALL 2018

BENEFITS OF MIGRATION Load Balancing Use more of available resources with less waiting Work packing Better align resource needs with available packages Fault Tolerance Move work away from failure (or maintenance) Reduced power consumption Improved efficiency means fewer machines Use newer, more efficient hosts, when available Use hosts better aligned for work, when available Etc.

AN IMAGINARY WORLD In an imaginary world, redistributing work within a cloud would be perfectly efficient Scheduling would be dramatically less important Utilization would be much higher If hotspots emerged – we could just move things around to fix them.

IMAGINARY WORLD: STILL NOT A SILVER BULLET This still wouldn’t fix all utilization challenges (Unless we imagine some more) Imbalanced uses of resources where no one place is a good fit Unused resources, e.g. memory, CPU time, etc, on active hosts The costs to start and stop host hardware Etc.

REAL WORLD COSTS VM migration is easier than process migration without VMs It provides an abstraction for bundling resources Especially when combined with a DFS/NFS shared by hosts But, there are still huge costs Copying VMs involves pausing them External interactions, such as network communication can be entangling Loss of network, memory, and storage time and resulting latency Other details, such as cache impact Key metrics: Downtime Migration time

COLD VS LIVE MIGRATION Cold Migration Suspend and serialize VM Copy VM to new host Possibly fix any dangling dependencies (copy files, fix network layer, etc) Resume VM Relatively easy But, extended down (unavailable) time.

COLD VS LIVE MIGRATION Live Migration Transfer VM with “no” (very limited) down time General steps Verify/Reserve resources on new host CPUs and slots, disk, RAM, network time, etc. Precopy Single pass or iterative until a certain number of passes or small enough delta on last pass Short freeze and clean-up to ensure consistency on new host and set environment, e.g. network configuration Can also use some form of LRU and fault in additional pages as needed. Resume execution on new host Possible fault in additional pages as needed

ALLOCATION VS MIGRATION VS KILL-ANDRESTART Migration is expensive Better to do a good job scheduling and migrate less than to schedule poorly and migrate more. It is sometimes cheaper to kill a VM on one host and restart it from scratch on another than to migrate the job May depend if value is in state or work accomplished And size of state that needs migration And entangling things, such as communication.

SCHEDULING RESOURCES: CHECK-OUT Sort of like PE equipment Ask for what you need. Wait for it to be available or reserve it in advance Take it and use it Return it when done Get a whole unit of the resource, need it or not. Obvious inefficiencies Resource packaging vs needed profile Size and types.

SCHEDULING RESOURCES: WHOLE HOST SCHEDULER Allocation decision is made by a scheduler User requests from Scheduler Resource is freed automatically when done Resource tells scheduler

SCHEDULING RESOURCES: HOST SHARING SCHEDULER Scheduler can dispatch more than one job onto the same host Traditionally allows overlapping of IO and processor burst cycles In more modern world, allows overlapping of many types of resources But Can generate hot spots Requires memory for all jobs on host How to know what will be available in the future? How to know what existing and new VMs will require in future? Hosts are not necessarily symmetric in capabilities or qualities thereof

SCHEDULING RESOURCES: CLASSIFYING RESOURCES AND REQUESTS Create classes of VMs with certain properties You’ve seen this on Amazon Processor, network, and disk throughput; cores, speed of cores Give each classification of resource its “share” of physical hardware Can have guarantees (lower bound) and limit (max bound) Create classes of physical hosts to normalize/unitize different capabilities Approximate host resources in manner compatible with VM measurements Often better to define in terms of throughput and/or bandwidth than percentage As physical hosts differ, share may be different across them Together, these provide information needed for binning

SCHEDULING RESOURCES: OVERCOMMITMENT Airlines overbook seats because people cancel reservations Workplace cafeterias have fewer seats than employees, because not everyone eats lunch “in” everyday Avoids wasting resources Risks broken promises What does the SLA say?

SCHEDULING RESOURCES: MULTI-VM REQUESTS What about need to allocate a compute cluster E.g. for an ML Affinity vs Anti-Affinity? Affinity allows faster rack-local communication, shared common caches, et Anti-Affinity prevents competing for common resources and/or isolate from host failure What to do with large requests If have enough resource available? No worries. Otherwise, run as available? Wait until enough are available? (Scheduling priority?) Collect resources as they become available until request is covered simultaneously? Many more types of resource usage constraints High level abstraction and language for communication provides for more expressiveness and better options

SCALING BACK Workloads tend to be bursty and vary wildly In many case predictably, at least for low-frequency component(s) Turning off unused machines saves a ton of energy They are also slow and costly to turn back on Good for low-frequency changes, e.g. Parts of day, days of week, weeks of year, etc. Too costly in time and energy to respond to rapid changes

SCALING BACK Shrinking hosts saves some energy and bounces back faster Turn off cores, slow down clocks, dynamic frequency scaling, etc Spin down disks Good for medium frequency changes Still introduces latency and energy cost to start back up. Even changing power states is not as fast as one might expect. Other costs, e.g. swapping things back in, disrupted communication, etc.

June 8, 2023

24 Less than a minute

Related Articles

AVIÓN Fabián Andres Morales Bedoya

Fall 2021 Plenary Summary Slides

Early-warning Indicators, Supervisory Intervention and

SOFTWARE MAINTENANCE 1

Wizualizacja danych 0. Informacje wstępne Leszek J Chmielewski Wydział

Antenna Design For Wireless Products Kerry Greer Vice President