OS and VM isolation
===================

Paper: comparing different isolation techniques.
  OS processes; Linux containers; gVisor; Firecracker.
  Common concerns: strong isolation, high performance, compatibility.
  Motivation: serverless compute platforms (AWS Lambda, Azure functions, etc).
    Customer provides code to run, but doesn't manage specific VMs / machines.
    Cloud provider spins up additional machines to handle load as needed.
    Security challenge: arbitrary code, need to isolate it from other customers.
    Performance challenge: load might vary widely.
      Can vary from a small fraction of machine to many machines.
      Can vary quickly, so need to start new instances quickly.
  Motivation: convergence of isolation technologies on container abstraction.
    Same logical abstraction (running processes with a given Linux file system image).
    Many ways to run this abstraction.
    Are there important differences?
  Motivation for this class: understand the different options for isolation.
    Also, lab 2 will use containers.

Isolation approaches
  Linux processes (as in OKWS)
  Containers, using Linux namespaces + cgroups
  "User-space kernel" or "library OS" (gVisor, Drawbridge)
  VMs
  Language runtimes (next lecture)

Linux processes.
  User IDs.
  Per-file permissions.
  Intended for fine-grained sharing between users, rather than coarse isolation.

Linux containers terminology refers to two different roles.
  Abstraction: packaging software along with all dependencies.
    Libraries, packages, files, etc.
    Useful even without isolation: what if two apps need different libssl versions?
  Isolation: ensuring that the code runs securely on same machine as other apps.
  Container abstraction could be used with other isolation plans.
    Open Container Initiative (OCI) has a standard container format.
    Docker container can run with Linux container isolation or gVisor.
      Even support for running Docker containers on Firecracker.
      [[ Ref: https://github.com/weaveworks/ignite ]]
  Container isolation built on two Linux mechanisms: namespaces and cgroups.
    Namespaces enable control over what files, processes, etc, are visible.
    Cgroups control resource use for performance isolation.

Why is isolation challenging in Linux?
  Lots of shared state ("resources") in the kernel.
  System calls access shared state by naming it.
    PIDs.
    File names.
    IP addresses / ports.
    (Even user IDs, in some form.)
  Typical access control revolves around user IDs (e.g., file permissions).
    Hard to use that to enforce isolation between two applications.
    Lots of files with permissions.
    Applications create shared files by accident or on purpose (e.g., world-writable).

Linux mechanism: chroot.
  Saw this in OKWS.
  Benefit: limits the files that one application can name.
    Doesn't matter if application accidentally creates world-writable files.
  Some technical limitations, but a good starting point for better isolation.

Namespaces provide a way of scoping the resources that can be named.
  [[ Ref: https://blog.quarkslab.com/digging-into-linux-namespaces-part-1.html ]]
  Process belongs to a particular namespace (for each namespace kind).
    New processes inherit namespace of the parent process.
  E.g., PID namespace limits the PIDs that a process can name.
  Coarse-grained isolation, not subject to what application might do.
  A better-designed chroot for different kinds of resources (not just file system).

Linux cgroups.
  Limit / scheduling for resource use.
  Memory, CPU, disk I/O, network I/O, etc.
  Applies to processes, much like namespaces.
    New processes inherit cgroup of the parent process.
  Not a security boundary, but important for preventing DoS attacks.
    E.g., one process or VM tries to monopolize all CPU or memory.

Containers using namespaces + cgroups.
  Unpack container's intended files somewhere in the file system.
  Allocate new namespace to run container.
  Point the root directory of the container namespace at container's file tree.
  Set up cgroup for container based on any scheduling policy.
  Set up a virtual network interface for container.
  Run processes in this container.
    Appears to run in a separate Linux system.
    Its own file system, its own network interface, its own processes (PIDs), etc.

Why might namespaces not be enough?
  Shared Linux kernel.
  Wide attack surface: 350+ system calls, many specialized functions under ioctl...
  Large amount of code, written in C.
    Bugs (buffer overflows, use-after-free, ...) continue to be discovered.
    No isolation within the Linux kernel itself.
  Kernel bugs let adversary escape isolation ("local privilege escalation" or LPE).
    Relatively common: new LPE bugs every year.

Additional security mechanism: seccomp-bpf.
  Idea: filter what system calls can be invoked by a process.
  Might help us address the wide attack surface of the Linux kernel.
  Common pattern: rarely-used syscalls or features are more likely to be buggy.
  Every process is (optionally) associated with a system call filter.
    Filter written as a little program in the BPF bytecode language.
    Linux kernel runs this filter on each syscall invocation, before running syscall.
    Filter program can decide if syscall should be allowed or not.
    Can look at syscall#, arguments, etc.
    New processes inherit syscall filter of parent process: "sticky".
  Can use seccomp-bpf to prevent access to suspect syscalls.
  Used by some container implementations.
    Set up bpf filter to disallow suspect system calls.
  Why might this not be good enough?
    Restricting syscalls could limit compatibility.
    Could break application code that uses uncommon syscalls.
    But still might not be enough for security (lots of code/bugs in common syscalls).

Additional security mechanisms: mandatory access control.
  E.g., Linux has LSM (https://en.wikipedia.org/wiki/Linux_Security_Modules)
    Many variants: AppArmor, SELinux, etc.
  Administrator specifies policy on what operations are allowed.
    Typically based on what user is running process, or what binary is executing.
    Can specify broad rules (e.g., no write access to any files).

More heavy-weight approach: VMs (QEMU).
  Run Linux in a guest VM.
    Much like the VM used to run lab code for 6.858.
  Why is this better than Linux?
    Smaller attack surface: no complex syscalls, just x86 + virtual devices.
    Fewer bugs / vulnerabilities: VM escape bugs discovered less than once a year.
  What's the downside?
    High start-up cost: takes a long time to boot up VM.
    High overhead: large memory cost for every running VM.
    Rigid/coarse resource allocation and sharing (VM memory; virtual disk; vCPU).
    Potential bugs in VMM itself (qemu): 1.4M lines of C code.

What's involved in a implementing support for VMs?
  Virtualizing the CPU and memory.
    Hardware support in modern processors.
    Nested page tables.
    Virtualizing privileged registers that are normally only accessible to kernel.
  Virtualizing devices.
    Disk controller.
    Network card.
    PCI.
    Graphics card.
    Keyboard, serial ports, ...
  Virtualizing the boot process.
    BIOS, boot loader.

Linux KVM.
  [[ Ref: https://www.kernel.org/doc/html/latest/virt/kvm/api.html ]]
  Abstraction for using hardware support for virtualization.
  Manages virtual CPUs, virtual memory.
  Corresponding hardware support: nested page tables.

QEMU.
  Implements virtual devices, similar to what real hardware would have.
  Also implements purely-virtual devices (virtio).
    Well-defined interface through shared memory regions.
  Also implements emulation of CPU instructions.
    Mostly not needed when using hardware support.
    But still used for instructions that hardware doesn't support natively.
    E.g., CPUID, INVD, ..
    [[ Ref: https://revers.engineering/day-5-vmexits-interrupts-cpuid-emulation/ ]]
  Also provides some BIOS implementation to start running the VM.

Firecracker design.
  Figure 3 in the paper.
  Use KVM for virtual CPU and memory.
  Re-implement QEMU, in Rust.
  Support minimal set of devices.
    virtio network, virtio block (disk), keyboard, serial.
  Block devices instead of file system: stronger isolation boundary.
    File system has complex state.
      Directories, files of variable length, symlinks / hardlinks.
    File system has complex operations.
      Create/delete/rename files, move whole directories, r/w range, append, ...
    Block device is far simpler:
      4 KByte blocks.
      Blocks are numbered 0 through N, which is the size of the disk.
      Read and write a whole block.  (And maybe flush / barrier.)
  Do not support instruction emulation.
    (Except for necessary instructions like CPUID, VMCALL/VMEXIT, ..)
  Do not support any BIOS at all.
    Just load the kernel into VM at initialization and start running it.

Firecracker implementation: Rust.
  Memory-safe language (modulo "unsafe" code).
  50K lines of code: much smaller than QEMU.
  Makes it unlikely that VMM implementation has bugs like buffer overflows.
  [[ Ref: https://github.com/firecracker-microvm/firecracker ]]

Firecracker VMM runs in a "jailed" process.
  chroot to limit files that can be accessed by VMM.
  namespaces to limit VMM from accessing other processes and network.
  running as a separate user ID.
  seccomp-bpf to limit what system calls the VMM can invoke.
  All to ensure that, if bugs in VMM are exploited, hard to escalate attack.

Firecracker security seems pretty good.
  Rust implementation: less bug-prone.
  Much less code in the VMM.
  Jailed VMM process.
  Linux KVM still part of TCB, but much smaller than QEMU.
  Still, KVM bugs would undermine Firecracker's isolation.
    [[ Ref: https://googleprojectzero.blogspot.com/2021/06/an-epyc-escape-case-study-of-kvm.html ]]
  Some bugs found in Firecracker:
    [[ Ref: https://github.com/firecracker-microvm/firecracker/issues/1462 ]]
      Memory bounds-checking issue, despite being written in Rust.
    [[ Ref: https://github.com/firecracker-microvm/firecracker/issues/2057 ]]
      DoS bug in network interface.
    [[ Ref: https://github.com/firecracker-microvm/firecracker/issues/2177 ]]
      Serial console buffer grew without bound.
      Could cause one VM to use lots of memory through the Firecracker process.

gVisor plan: re-implement the OS syscall interface in a separate user-space process.
  Figure 2 in the paper.
  Intercept syscalls from processes running in the container (using ptrace or KVM).
  User-space process that implements those syscalls, written in Go.
    Again, better language than C for avoiding buffer overflows, other mistakes.
  Benefit: less likely to have memory management bugs in Go code.
  Benefit: bugs aren't in kernel code, likely contained by Linux process.
    Use seccomp-bpf to limit what syscalls the gVisor emulator can invoke.
  Benefit: finer-grained sharing.
    Could share specific files or directories.
  Benefit: finer-grained resource allocation.
    Not just a monolithic virtual disk or entire VM memory allocation.
    Perhaps important for running a small application in isolation.
  Downside: performance overheads could be significant.
    Every system call must be redirected to gVisor process.
    Context-switch overhead, data copying overhead, etc.
  Possible downside (or upside): compatibility (real Linux vs gVisor).
    gVisor does a credible job faithfully implementing Linux syscalls, though!
    Could make it possible to emulate new syscalls on old host.

Common pattern: supervisor/monitor process (gVisor's Sentry, Firecracker's VMM).
  Handles requests from isolated code, instead of exposing the host OS.
  Supervisor process is itself sandboxed, isolated.
  seccomp, cgroups used to limit damage from Sentry and Firecracker VMM.
  Privilege separation within gVisor: Gofer is separate from Sentry.
    Gofer has extra privileges (file access).
    Don't want arbitrary file access by compromised Sentry.

Security comparison: syscalls accessible.
  Total of ~350 syscalls.
  LXC (Docker): blocks 44 syscalls (so 300+ allowed).
  Firecracker: 36 syscalls allowed for VMM.
  gVisor: 53-68 syscalls allowed for Sentry.

Code coverage comparison.
  Native processes execute the least code.
  Significant overlap between all 3 isolation platforms
  Firecracker, gVisor use hardware virtualization support.
    LXC (and native) doesn't use it.
  Why are all isolation frameworks hitting the same code?
    One possible answer: ultimately need to implement desired function.
      Context-switching, storage, networking, ..
    Indirection layers might be important for isolation.
    Still need the host to access storage, network, etc.
  Some evidence for hitting the same code: Figures 9-12 (network flame graphs).
    All 4 isolation platforms need to send packets.
    Copying data is the biggest cost.
    Memory allocation.
    Firecracker and gVisor don't use host TCP stack.

CPU performance?
  All platforms have pretty reasonable performance.

Network performance?
  User-space network stack in gVisor is not nearly as optimized.
  Firecracker ends up running Linux network stack inside the VM, great throughput.
  High latency for Firecracker, due to packets going through two network stacks.

Memory management?
  Coarse-grained memory allocation in Firecracker: less dynamic allocation costs.
  Similar code running for native, LXC, and gVisor.

Storage?
  Firecracker has coarse-grained virtual disk file, less per-file operations.
  Native, LXC, and gVisor all end up mapping guest files 1-to-1 to host files.
  Firecracker is really fast but mostly because it's not making writes durable.
  Forcing Firecracker to flush brings its storage perf in line with other systems.

What are some potential benefits or downsides of each of the platforms?
  Native: simple, least code being executed, least overhead.
  LXC: isolation and container abstraction, flexible sharing, near-native perf.
  gVisor: strong isolation but still flexible sharing, resource allocation.
  Firecracker: strong isolation, better perf than gVisor, but coarse-grained.

Summary.
  Isolation is a key building block for security (yet again).
  Challenging to achieve isolation along with other goals:
    High performance.
    Low overheads (memory, context switching, etc).
    Compatibility with existing systems (e.g., Linux).
  Comparison of different OS/VM-based isolation mechanisms.