OS and VM isolation
===================

Case study: Amazon's Firecracker paper.
  Lambda service: run customer-supplied Linux app, scaling to load.
  Security challenge: arbitrary code, need to isolate it from other customers.
  Performance challenge: load might vary widely.
    Could be much less than a machine (so lots of customers per machine).
    Could quickly grow and will need to start running customer code on new machine.
  Goal: low-overhead but strong isolation domains.

Isolation approaches
  Linux processes (as in OKWS)
  Containers, using Linux namespaces + cgroups
  VMs
  Language runtimes

Linux processes.
  User IDs.
  Per-file permissions.
  Intended for fine-grained sharing between users, rather than coarse isolation.

Linux containers serve two purposes.
  Packaging software along with all dependencies (libraries, packages, files, etc).
  Security and performance isolation for running that software.
  Both rely on Linux namespaces: abstraction of running on a separate Linux machine.
  Performance isolation uses Linux cgroups to control resource use.

Why is isolation challenging in Linux?
  Lots of shared state ("resources") in the kernel.
  System calls access shared state by naming it.
    PIDs.
    File names.
    IP addresses / ports.
    (Even user IDs, in some form.)
  Typical access control revolves around user IDs (e.g., file permissions).
    Hard to use that to enforce isolation between two applications.
    Lots of files with permissions.
    Applications create shared files by accident or on purpose (e.g., world-writable).

Linux mechanism: chroot.
  Saw this in OKWS.
  Benefit: limits the files that one application can name.
    Doesn't matter if application accidentally creates world-writable files.
  Some technical limitations, but a good starting point for better isolation.

Namespaces provide a way of scoping the resources that can be named.
  [[ Ref: https://blog.quarkslab.com/digging-into-linux-namespaces-part-1.html ]]
  Process belongs to a particular namespace (for each namespace kind).
    New processes inherit namespace of the parent process.
  E.g., PID namespace limits the PIDs that a process can name.
  Coarse-grained isolation, not subject to what application might do.
  A better-designed chroot for different kinds of resources (not just file system).

Linux cgroups.
  Limit / scheduling for resource use.
  Memory, CPU, disk I/O, network I/O, etc.
  Applies to processes, much like namespaces.
    New processes inherit cgroup of the parent process.
  Not a security boundary, but important for preventing DoS attacks.
    E.g., one process or VM tries to monopolize all CPU or memory.

Containers using namespaces + cgroups.
  Unpack container's intended files somewhere in the file system.
  Allocate new namespace to run container.
  Point the root directory of the container namespace at container's file tree.
  Set up cgroup for container based on any scheduling policy.
  Set up a virtual network interface for container.
  Run processes in this container.
    Appears to run in a separate Linux system.
    Its own file system, its own network interface, its own processes (PIDs), etc.

Why are namespaces not sufficient for Lambda?
  Shared Linux kernel.
  Wide attack surface: 300+ system calls, many specialized functions under ioctl...
  Large amount of code, written in C.
    Bugs (buffer overflows, use-after-free, ...) continue to be discovered.
    No isolation within the Linux kernel itself.
  Kernel bugs let adversary escape isolation ("local privilege escalation" or LPE).
    Relatively common: new LPE bugs every year.

Additional security mechanism: seccomp-bpf.
  Idea: filter what system calls can be invoked by a process.
  Might help us address the wide attack surface of the Linux kernel.
  Common pattern pointed out in Lambda paper:
    Rarely-used syscalls or features are more likely to be buggy.
  Every process is (optionally) associated with a system call filter.
    Filter written as a little program in the BPF bytecode language.
    Linux kernel runs this filter on each syscall invocation, before running syscall.
    Filter program can decide if syscall should be allowed or not.
    Can look at syscall#, arguments, etc.
    New processes inherit syscall filter of parent process: "sticky".
  Can use seccomp-bpf to prevent access to suspect syscalls.
  Used by some container implementations.
    Set up bpf filter to disallow suspect system calls.
  Why is this not good enough for Lambda?
    Bad trade-off.
    Starting to break some customer code that uses uncommon syscalls.
    But still might not be enough for security (lots of code/bugs in common syscalls).

Additional security mechanisms: mandatory access control.
  E.g., Linux has LSM (https://en.wikipedia.org/wiki/Linux_Security_Modules)
    Many variants: AppArmor, SELinux, etc.
  Administrator specifies policy on what operations are allowed.
    Typically based on what user is running process, or what binary is executing.
    Can specify broad rules (e.g., no write access to any files).

Why not language runtimes for Lambda?
  Want to support arbitrary Linux binaries.
  Will look in more detail at language runtimes next week.
  Language runtimes are appealing for Lambda-like workloads, though: low overheads.
    [[ Ref: https://developers.cloudflare.com/workers/learning/security-model ]]

More heavy-weight approach: VMs.
  Run Linux in a guest VM.
    Much like the VM used to run lab code for 6.858.
  Why is this better than Linux?
    Smaller attack surface: no complex syscalls, just x86 + virtual devices.
    Fewer bugs / vulnerabilities: VM escape bugs discovered less than once a year.
  Why are these not good enough for Lambda either?
    High start-up cost: takes a long time to boot up VM.
    High overhead: large memory cost for every running VM.
    Potential bugs in VMM itself (qemu): 1.4M lines of C code.
  Paper's plan: write a new VMM, but keep using KVM.

What's involved in a implementing support for VMs?
  Virtualizing the CPU and memory.
    Hardware support in modern processors.
    Nested page tables.
    Virtualizing privileged registers that are normally only accessible to kernel.
  Virtualizing devices.
    Disk controller.
    Network card.
    PCI.
    Graphics card.
    Keyboard, serial ports, ...
  Virtualizing the boot process.
    BIOS, boot loader.

Linux KVM.
  [[ Ref: https://www.kernel.org/doc/html/latest/virt/kvm/api.html ]]
  Abstraction for using hardware support for virtualization.
  Manages virtual CPUs, virtual memory.
  Corresponding hardware support: nested page tables.

QEMU.
  Implements virtual devices, similar to what real hardware would have.
  Also implements purely-virtual devices (virtio).
    Well-defined interface through shared memory regions.
  Also implements emulation of CPU instructions.
    Mostly not needed when using hardware support.
    But still used for instructions that hardware doesn't support natively.
    E.g., CPUID, INVD, ..
    [[ Ref: https://revers.engineering/day-5-vmexits-interrupts-cpuid-emulation/ ]]
  Also provides some BIOS implementation to start running the VM.

Firecracker design.
  Use KVM for virtual CPU and memory.
  Re-implement QEMU.
  Support minimal set of devices.
    virtio network, virtio block (disk), keyboard, serial.
  Block devices instead of file system: stronger isolation boundary.
    File system has complex state.
      Directories, files of variable length, symlinks / hardlinks.
    File system has complex operations.
      Create/delete/rename files, move whole directories, r/w range, append, ...
    Block device is far simpler:
      4 KByte blocks.
      Blocks are numbered 0 through N, which is the size of the disk.
      Read and write a whole block.  (And maybe flush / barrier.)
  Do not support instruction emulation.
    (Except for necessary instructions like CPUID, VMCALL/VMEXIT, ..)
  Do not support any BIOS at all.
    Just load the kernel into VM at initialization and start running it.

Firecracker implementation: Rust.
  Memory-safe language (modulo "unsafe" code).
  50K lines of code: much smaller than QEMU.
  Makes it unlikely that VMM implementation has bugs like buffer overflows.
  [[ Ref: https://github.com/firecracker-microvm/firecracker ]]

Firecracker VMM runs in a "jailed" process.
  chroot to limit files that can be accessed by VMM.
  namespaces to limit VMM from accessing other processes and network.
  running as a separate user ID.
  seccomp-bpf to limit what system calls the VMM can invoke.
  All to ensure that, if bugs in VMM are exploited, hard to escalate attack.

Lambda architecture around the Firecracker core mechanism.
  Many workers (physical machines running Firecracker-based MicroVMs).
  Each worker has a fixed number of MicroVM "slots" available to run stuff.
  Customer code gets loaded into a "slot" by starting a VM, loading customer code.
  Worker manager is in charge of deciding where requests get routed.
  Frontend looks up worker from worker manager, sends request directly there.

How well does Firecracker achieve its goals?
  Overhead seems quite low.
    3MB memory overhead per idle VM.
    125msec boot time.
  Performance seems OK.
    CPU performance is basically KVM (so, unchanged).
    Device I/O performance is not so great.
      Slow virtual disk: need concurrency.
      Slow virtual network: need PCI pass-through.
  Security probably pretty good.
    Rust implementation: less bug-prone.
    Much less code in the VMM.
    Jailed VMM process.
    Linux KVM still part of TCB, but much smaller than QEMU.
    Still, KVM bugs would undermine Firecracker's isolation.
      [[ Ref: https://googleprojectzero.blogspot.com/2021/06/an-epyc-escape-case-study-of-kvm.html ]]
    Some bugs found in Firecracker:
      [[ Ref: https://github.com/firecracker-microvm/firecracker/issues/1462 ]]
        Memory bounds-checking issue, despite being written in Rust.
      [[ Ref: https://github.com/firecracker-microvm/firecracker/issues/2057 ]]
        DoS bug in network interface.
      [[ Ref: https://github.com/firecracker-microvm/firecracker/issues/2177 ]]
        Serial console buffer grew without bound.
        Could cause one VM to use lots of memory through the Firecracker process.

Firecracker used outside of Lambda.
  [[ Ref: https://fly.io/blog/sandboxing-and-workload-isolation/ ]]

Alternative plan: redirect system calls to a different implementation.
  Instead of blocking system calls, intercept them and implement elsewhere.
  gVisor: user-space implementation of many Linux system calls, in Go.
  Benefit: less likely to have memory management bugs in Go code.
  Benefit: bugs aren't in kernel code, likely contained by Linux process.
    Use seccomp-bpf to limit what syscalls the gVisor emulator can invoke.
  Downside: performance overheads could be significant.
    Every system call must be redirected to gVisor process.
    Context-switch overhead, data copying overhead, etc.
  Possible downside: compatibility (real Linux vs gVisor).
    gVisor does a credible job faithfully implementing Linux syscalls, though!

Summary.
  Isolation is a key building block for security (yet again).
  Challenging to achieve isolation along with other goals:
    High performance.
    Low overheads (memory, context switching, etc).
    Compatibility with existing systems (e.g., Linux).
  Real-world case study of engineering an isolation mechanism.