CS6456: Graduate Operating Systems Brad Campbell – bradjc@virginia

53 Slides7.18 MB

CS6456: Graduate Operating Systems Brad Campbell – [email protected] https://www.cs.virginia.edu/ bjc8c/class/cs6456-f19/ 1

What is virtualization? Virtualization is the ability to run multiple operating systems on a single physical system and share the underlying hardware resources1 Allows one computer to provide the appearance of many computers. Goals: Provide flexibility for users Amortize hardware costs Isolate completely separate users 1 VMWare white paper, Virtualization Overview 2

Formal Requirements for Virtualizable Third Generation Architectures “First, the VMM provides an environment for programs which is essentially identical with the original machine; second, programs run in this environment show at worst only minor decreases in speed; and last, the VMM is in complete control of system resources.” 3

VMM Platform Types Hosted Architecture Install as application on existing x86 “host” OS, e.g. Windows, Linux, OS X Small context-switching driver Leverage host I/O stack and resource management Examples: VMware Player/Workstation/Server, Microsoft Virtual PC/Server, Parallels Desktop Bare-Metal Architecture “Hypervisor” installs directly on hardware Acknowledged as preferred architecture for high-end servers Examples: VMware ESX Server, Xen, Microsoft Viridian (2008) 5

Virtualization: rejuvenation 1960’s: first track of virtualization Time and resource sharing on expensive mainframes IBM VM/370 Late 1970s and early 1980s: became unpopular Cheap hardware and multiprocessing OS Late 1990s: became popular again Wide variety of OS and hardware configurations VMWare Since 2000: hot and important Cloud computing Docker containers 6

IBM VM/370 Virtual machines Conversational Monitor System (CMS) Specialized VM subsystem (RSCS, RACF, GCS) Mainstream OS (MVS, DOS/VSE etc.) Hypervisor Control Program (CP) Hardware System/370 Another copy of VM 8

IBM VM/370 Technology: trap-and-emulate Normal Privileged Application Kernel Trap Emulate CP 9

Trap and Emulate Virtualization on x86 architecture Challenges Correctness: not all privileged instructions produce traps! Example: popf popf does different things in kernel mode vs. user mode Performance: System calls: traps in both enter and exit (10X) I/O performance: high CPU overhead Virtual memory: no software-controlled TLB 10

Virtualization on x86 architecture Solutions: Dynamic binary translation & shadow page table Para-virtualization (Xen) Hardware extension 11

Dynamic binary translation Idea: intercept privileged instructions by changing the binary Cannot patch the guest kernel directly (would be visible to guests) Solution: make a copy, change it, and execute it from there Use a cache to improve the performance 12

Binary translation Directly execute unprivileged guest application code Will run at full speed until it traps, we get an interrupt, etc. “Binary translate” all guest kernel code, run it unprivileged Since x86 has non-virtualizable instructions, proactively transfer control to the VMM (no need for traps) Safe instructions are emitted without change For “unsafe” instructions, emit a controlled emulation sequence VMM translation cache for good performance 13

How does VMWare do this? Binary – input is x86 “hex”, not source Dynamic – interleave translation and execution On Demand – translate only what about to execute (lazy) System Level – makes no assumptions about guest code Subsetting – full x86 to safe subset Adaptive – adjust translations based on guest behavior 14

Convert unsafe operations and cache them Input: BB 55 ff 33 c7 03 . translator Each Translator Invocation Consume a basic block (BB) Produce a compiled code fragment (CCF) Store CCF in Translation Cache Future reuse Capture working set of guest kernel Amortize translation costs Not “patching in place” Output: CCF 55 ff 33 c7 03 . 15

Dynamic binary translation Pros: Make x86 virtualizable Can reduce traps Cons: Overhead Hard to improve system calls, I/O operations Hard to handle complex code 16

Shadow page table 17

Shadow page table Guest page table Shadow page table 18

Shadow page table Pros: Transparent to guest VMs Good performance when working set is stable Cons: Big overhead of keeping two page tables consistent Introducing more issues: hidden fault, double paging 19

Xen 20

Xen and the art of virtualization SOSP’03 Very high impact (data collected in 2013) Citation count in Google scholar 6000 5000 4000 3000 2000 1000 0 5153 8807 461 1093 1219 1222 1229 1413 1796 2116 2286 3090 21

Para-virtualization Full vs. para virtualization 22

Overview of the Xen approach Support for unmodified application binaries (but not OS) Keep Application Binary Interface (ABI) Modify guest OS to be aware of virtualization Get around issues of x86 architecture Better performance Keep hypervisor as small as possible Device driver is in Dom0 23

Xen architecture 24

Virtualization on x86 architecture Challenges Correctness: not all privileged instructions produce traps! Example: popf Performance: System calls: traps in both enter and exit (10X) I/O performance: high CPU overhead Virtual memory: no software-controlled TLB 25

CPU virtualization Protection Xen in ring0, guest kernel in ring1 Privileged instructions are replaced with hypercalls Exception and system calls Guest OS registers handles validated by Xen Allowing direct system call from app into guest OS Page fault: redirected by Xen 26

Memory virtualization Xen exists in a 64MB section at the top of every address space Guest sees real physical address Guest kernels are responsible for allocating and managing the hardware page tables. After registering the page table to Xen, all subsequent updates must be validated. 28

I/O virtualization Shared-memory, asynchronous buffer descriptor rings 29

Porting effort is quite low 30

Evaluation 31

Evaluation 32

Conclusion x86 architecture makes virtualization challenging Full virtualization unmodified guest OS; good isolation Performance issue (especially I/O) Para virtualization: Better performance (potentially) Need to update guest kernel Full and para virtualization will keep evolving together 33

Corollary: How often do we lose our audience? My tip: put the point of the slide directly in the title “Evaluation” vs. “Xen is 1.1x to 3x more performant than VMWare” 34

Instead: Leverage hardware support First generation - processor Second generation - memory Third generation – I/O device In progress 35

IA Protection Rings (CPL) Actually, IA has four protection levels, not two (kernel/user). CPU Privilege Level (CPL) IA/X86 rings (CPL) Ring 0 – “Kernel mode” (most privileged) Ring 3 – “User mode” Ring 1 & 2 – Other Linux only uses 0 and 3. “Kernel vs. user mode” Pre-VT Xen modified to run the guest OS kernel to Ring 1: reserve Ring 0 for hypervisor. Increasing Privilege Level Ring 0 Ring 1 Ring 2 Ring 3 [Fischbach] 36

Why aren’t (IA) rings good enough? VM CPL 3 guest kernel hypervisor CPL 1 CPL 0 Increasing Privilege Level Ring 0 Ring 1 Ring 2 ? Ring 3 37

A short list of pre-VT problems Early IA hypervisors (VMware, Xen) had to emulate various machine behaviors and generally bend over backwards. IA32 page protection does not distinguish CPL 0-2. Segment-grained memory protection only. Ring aliasing: some IA instructions expose CPL to guest! Or fail silently Syscalls don’t work properly and require emulation. sysenter always transitions to CPL 0. (D’oh!) sysexit faults if the core is not in CPL 0. Interrupts don’t work properly and require emulation. Interrupt disable/enable reserved to CPL0. 38

First generation: Intel VT-x & AMD SVM Eliminating the need of binary translation or modifying OSes Host mode Guest mode Ring3 Ring3 Ring2 Ring1 Ring0 VMRUN VMEXIT Ring2 Ring1 Ring0 39

VT in a Nutshell New VM mode bit Orthogonal to CPL (e.g., kernel/user mode) If VM mode is off host mode Machine “looks just like it always did” (“VMX root”) If VM bit is on guest mode Machine is running a guest VM: “VMX non-root mode” Machine “looks just like it always did” to the guest, BUT: Various events trigger gated entry to hypervisor (in VMX root) A “virtualization intercept”: exit VM mode to VMM (VM Exit) Hypervisor (VMM) can control which events cause intercepts Hypervisor can examine/manipulate guest VM state and return to VM (VM Entry) 40

CPU Virtualization With VT-x Virtual Machines (VMs) Two new VT-x operating modes Less-privileged mode (VMX non-root) for guest OSes More-privileged mode (VMX root) for VMM Ring 3 Apps Apps Ring 0 OS OS Two new transitions VM entry to non-root operation VM exit to root operation VM Exit VMX Root VM Entry VM Monitor (VMM) Execution controls determine when exits occur Access to privilege state, occurrence of exceptions, etc. Flexibility provided to minimize unwanted exits VM Control Structure (VMCS) controls VT-x operation Also holds guest and host state 41

Second generation: Intel EPT & AMD NPT Eliminating the need to shadow page table 42

Containers and isolation 48

Containers: idea Run a process with a restricted view of system resources Hide other processes Limit access to system resources Not full virtualization Process must use existing kernel and OS 49

Pre-container isolation features in Linux chroot Set the current root directory for processes Added to Unix in 1979 Namespaces Provide processes with their own view of resources Process IDs, networking sockets, hostnames, etc. Akin to virtual address spaces Originally introduced in 2002 Copy-on-Write Filesystem Allow a process to view existing filesystem, but any modifications result in copies then updates Akin to virtual memory after fork 50

cgroups in Linux Enforce policies to restrict how processes use resources Example: A process can only use 1 Mbps of network bandwidth Policies are enforced on groups of processes Controllers enforce various processes 51

cgroups are hierarchical Resources inherited from parent groups Processes can only be in leaf groups 52

Controllers enforce restrictions for cgroups Different controllers for different system resources Each provides a policy for how processes should be restricted 53

Controllers in Linux io Limit I/O requests either capped per process or proportionally. memory Enforce memory caps on processes pids Limit number of new processes in a cgroup perf event Allow monitoring performance rdma Limit remote DMA cpu Limit CPU usage when CPU is busy freezer Allow suspending all processes in a cgroup 54

cgroups namespaces CoW Enable containerized processes on a shared kernel Lightweight virtualization 55

Docker Docker provides an interface on top of these underlying designs Popularized containers Also includes packaging a filesystem in a docker image Provides more of a “virtual machine” notion But processes still running on existing kernel Interesting case in commercializing existing OS functionality 56

Third generation: Intel VT-d & AMD IOMMU I/O device assignment VM owns real device DMA remapping Support address translation for DMA Interrupt remapping Routing device interrupt OSDI’12 57

Run App as a “process” in guest mode, and let it use all CPLs. 58

59

Hypervisor calls VT VMCALL operation (an instruction) voluntarily traps to hypervisor. Similar to SYSCALL to trap to kernel mode. Not needed for transparent virtualization: but Dune uses it to call the “real” kernel. 60

Memory&management&in&Dune& Configure&the&EPT&to& provide&process&memory& User&programs&can&then& directly&access&the&page& table& Kernel' Host-Virtual& Kernel& Page& Table& Dune&Process' Guest-Virtual& User& Page& Table& Guest-Physical& EPT& Host-Physical&(RAM)& 15& 61

Back to top button