Software fault isolation: WebAssembly
=====================================

Goal: run fast code in web applications.
  Javascript is universally supported but not a great fit in some cases.
    High-performance code.
    Applications written in a language that's not JS.
  Challenge: code must be isolated.
    Cannot have code from one website tamper with data from another site.
    Cannot have code from a website tamper with data on user's computer.

We've seen a number of isolation mechanisms already.  Why yet another one?
  Some require special privileges.
    Need to be root in order to set up chroot, UIDs, etc.
  Some work only on specific platforms.
    Firecracker runs only on Linux.
      But then what about browsers on Windows, MacOS, ...?
    VMs work well on CPUs that support hardware virtualization.
      But then what about devices without direct hardware VM support?
  Some require the system to be pre-configured by administrator.
    Containers require Docker / LXC / etc to be installed first.
    Administrator must create virtual network interfaces.
    Administrator must set up UID namespaces.

Approach: software fault isolation.
  Powerful idea for achieving isolation.
  Does not rely on hardware or OS support.
    Does not require special privileges to deploy.
    Can provide isolation even in the face of some hw/os bugs.
  But requires cooperation between developers and devices that run their code.
    Cannot take existing binary and run it in isolation.
    Contrast with containers or VMs: can run existing Linux applications.

Aside: Java sandboxing.
  Java VM.
  Stack-based access control.
  Some code is trusted (e.g., can access files), but some code is not.
  File access code checks if the caller (stack frames) are trusted or not.
  Easy to have a mistake: confused deputy style problems.
  [[ Ref: http://phrack.org/issues/70/7.html ]]

Modern SFI system: WebAssembly.
  [[ Demo: wasm/README ]]
  lib.c has some code
  Can run it on command-line:
    wasmtime main.wasm "test Hello world"
  Can also run it in the browser:
    python -m http.server 8080
    load http://localhost:8080/demo.html
  Corrupting memory in lib.c gets caught.
    uncomment memset
    recompile (make)
    wasmtime main.wasm test, catches error
    shift-reload browser, catches error as well

WebAssembly module.
  Functions.  Define all of the code in the module.
  Globals.  Global variables.
  Tables.  Targets of function pointers.
  Memory.  Contiguous memory from 0 to sz.
  Imports and exports.  E.g., can mark a function as exported, or import function.

WebAssembly workflow.
  [[ Application code (e.g., C) ]]
    -> C compiler that can generate WebAssembly ->
  [[ WebAssembly module / file ]]
    -> Validator in web browser (typing checks, Section 4)
    -> Interpreter or compiler (execution, Section 3) ->
  [[ Native code that can execute the WebAssembly module ]]

Challenge: performance + isolation.
  Could write an interpreter for whatever code we want to run.
    Javascript sort-of works this way.
    Could even be x86: just run qemu in the browser to emulate x86.
  Interpreter will not allow code to access outside of isolation boundary.
  But that's slow; goal is high performance.

Why would it be difficult to just isolate arbitrary x86 running natively?
  Ultimately want to prevent:
    Syscalls.
    Accessing memory outside of the isolated module.
  But hard to control what the code does through validation.
    Hard to determine what memory addresses will be accessed.
    Hard to tell if code will jump somewhere else, outside of module.
    Perhaps hard to tell if there's syscall instructions even.
      Variable-length instructions in x86.
  At runtime, code uses computed jumps and computed memory accesses.
    Without hardware support, cannot control what the addresses will be.

Why is WebAssembly better than x86?
  Clever design enables simple (and thus secure) high-performance interpreter.

High performance: translate WebAssembly into native code (e.g., x86, x86-64).
  Basic operations like arithmetic should be more-or-less one-to-one.
  That is, one WebAssembly operation can compile into one x86-64 instruction.
  But need to be more careful about state and control flow.
    State: shouldn't be able to access anything outside the module.
    Control flow: shouldn't jump to any code that's not "properly translated."

What's the plan for ensuring safety of the executed code?
  WebAssembly instructions that access state should not escape the sandbox.
    Need to make sure we're accessing legal local var / global var / heap memory.
    Add checks to the generated native code as needed to make this safe.
  Ensure that the generated code can only jump to other generated code.
    Cannot jump to code outside of sandbox.
    Cannot jump to mis-aligned code.
    Cannot jump to code in sandbox bypassing the check.
    Typically this is called "control flow integrity" or CFI.

What state can the WebAssembly code access?
  Stack values.
  Local variables.
  Global variables.
  Heap memory, grows from 0 upwards.

How to bounds-check global var accesses?
  Known ahead of time how many global vars are in a module.
  Can check global var access at compile-time.
  Emit code that is always safe to run.
    Just access globals[off] where off is already a checked constant.
    Runtime maintains some way to access the base of the globals.
  Accessing in-bounds globals is safe.
    WebAssembly code could write arbitrary values to global variable.
    But that's OK: isolation doesn't depend on specific values.

How to bounds-check local var accesses?
  Every function has a well-known (constant) number of local variables.
    Local variables include arguments to a function call.
    As with globals, can check the offset is in-range at compile time.
  Slight complication: how do we access the local variables?
    Same function could be invoked many times recursively.
    Want to access the local variables for current invocation.
  At runtime, these local variables are stored on the (native) stack.
    Intermingled with (WebAssembly) stack values.
    See next.

How to bounds-check stack locations?
  Big question:
    Is this stack offset valid to access?
  How are WebAssembly local variables + stack implemented in terms of a native stack?
    Some WebAssembly stack values are never materialized: stored in registers.
    Real stack stores WebAssembly local vars, stack values (that aren't omitted),
      plus return addresses, saved registers when calling.
    Need to make sure return addr doesn't get corrupted by storing to local var.

WebAssembly's structured control flow helps check stack accesses.
  At every point in a function, it's known how many values are on the runtime stack.
  Control flow within a function must always agree on stack contents.
    E.g., converging control flow must agree on how many stack items there are.
  One benefit: compile-time check for how far up the stack code can access.
    Compiler knows the stack depth at every instruction in WebAssembly code.
  Second benefit: compiler knows where the local variables are located.
    Accessing local variable requires skipping WebAssembly stack values on stack.
    But since we know stack depth, compiler can easily add that offset.

Bounds-checking memory operations.
  E.g., i64.store stores 64-bit value (8 bytes) at some address A.
  Need to make sure the resulting code cannot corrupt state outside of sandbox.
    In particular, addresses A through A+7 must be within the memory region.
  What should the compiled code look like?
    if a+8 >= memsz { panic }
    do the store
  Compiler can potentially eliminate repeated bounds checks.
    uint64 a[4];
    a[0] = ..;
    a[1] = ..;
    a[2] = ..;
    a[3] = ..;
  Suffices to check once:
    if a+4*8 >= memsz { panic }
    do the 4 stores

Clever use of virtual memory for bounds checks.
  WebAssembly specification says memory is at most 4GB in size.
  Address used for i64.store is a 32-bit unsigned address.
  Instruction can also specify a static offset.
    Common to access arrays, which means two offsets.
      a[5] = 6;
  So, full memory store operation is:
    *(membase + address + offset) = value
  WebAssembly specification requires both address and offset to be 32-bit values.
    Thus, resulting address will be at most 8GB from membase.
  Trick: virtual memory reserves 8GB region of memory for all possible addresses.
    Worst case, will hit page fault if access is to memory beyond allocated size.

Small trade-off with VM-based bounds checking.
  Not quite deterministic anymore.
  Depending on runtime, program might or might not stop on out-of-bounds store,
    if store is slightly out-of-bounds.

No memory safety within a module.
  Entirely possible for a C program compiled to WebAssembly to have buffer overflow.
  E.g., corrupt heap-allocated memory region.
  WebAssembly doesn't prevent overflow, so C program could misbehave arbitrarily.
  But does ensure that the WebAssembly code can't affect the rest of the system.

Why is value type checking important?  (i32 vs i64)
  Compiler may want to know if a value might be limited to 32 bits.
  Allows some optimizations like loading/storing memory locations.

Why is function type checking important?  (How many arguments does a function take?)
  WebAssembly function can access (and modify) arguments.
    They are just like local vars.
  Thus, wasm function can modify some number of runtime stack locations,
    depending on how many arguments it declares itself as taking.
  Need to make sure the caller gave us enough arguments on the stack.
    Otherwise function can read/write other items on runtime stack!
    Could be some native return address, which would be bad.

How to achieve control flow integrity?
  Direct jumps: compiler ensures the native jump goes to proper compiled code.
  Function calls: set up the stack in the way the target function expects.
    Namely, args + return address should be at the bottom of the stack after call.
    Type safety for function types ensures we put the right number of arguments.
  Indirect function calls: table of jump targets.
    Compiler ensures that all jump targets are valid functions.
    WebAssembly call_indirect emits native code to jump to some function from table.

Widely supported in browsers now.
  Firefox, Chrome, Safari, mobile variants, IE Edge, ...
  Used in serious applications, like Adobe Photoshop.
  [[ Ref: https://web.dev/ps-on-the-web/ ]]

Quite a bit of activity, new features.
  [[ Ref: https://github.com/WebAssembly/proposals ]]
  E.g., work on adding support for threads, vector instructions, etc.

Also used outside of web browsers.
  E.g., Fastly and Cloudflare CDNs use WebAssembly.
    Goal: run code on a server near requesting user.
    Similar goals to AWS Lambda, perhaps even more focus on low-overhead starts.
    Run the compilation step once, then binary is fast to start running.
    [[ Ref: https://docs.fastly.com/products/compute-at-edge ]]
    [[ Ref: https://blog.cloudflare.com/webassembly-on-cloudflare-workers/ ]]
  Cloudflare also uses Javascript-level sandboxing.
    [[ Ref: https://blog.cloudflare.com/cloud-computing-without-containers/ ]]
  Ongoing work to define standard interface to the system outside of a module.
    [[ Ref: https://github.com/WebAssembly/WASI ]]

Why the formalization?
  Clear definition of how to run WebAssembly programs.
  Important for consistency across implementations (browsers, runtimes).
  Also good starting point for arguing about correctness, security.

What do security vulnerabilities in WebAssembly sandboxes look like?
  [[ Ref: https://www.fastly.com/blog/defense-in-depth-stopping-a-wasm-compiler-bug-before-it-became-a-problem ]]
  Compiler relied on 32-bit type optimization for memory accesses.
    Knows that certain values are 32 bits only.
    If accessing memory with a known-32-bit offset, omit truncation.
    Otherwise, emit instruction that will force-zero the upper 32 bits of 64-bit value.
  Compiler stored values in CPU registers (rather than stack locations) for performance.
  When it ran out of registers, stored these registers on runtime stack.
  Bug: when reloading 32-bit value from stack, accidentally sign-extended it.
    I.e., 0x80000000 was loaded as 0xffffffff80000000 instead of 0x0000000080000000.
    Just a wrong instruction used to restore 32-bit value from stack to register.
  Now it's not safe to add this (supposedly known-32-bit) value for memory access.
    Ends up subtracting from memory base instead of adding to it.

Possible to reduce TCB to rely on a verifier rather than entire compiler.
  Add a verifier that accepts binary result of compiler and outputs yes/no.
  Verifier can be much smaller than compiler.
  E.g., VeriWasm for Lucet (now Wasmtime).
    [[ Ref: https://cseweb.ucsd.edu/~dstefan/pubs/johnson:2021:veriwasm.pdf ]]
    [[ Ref: https://github.com/bytecodealliance/wasmtime ]]

Possible to reduce the cost of transitions between isolation domains even further.
  [[ Ref: https://cseweb.ucsd.edu/~dstefan/pubs/kolosick:2022:isolation.pdf ]]

Summary.
  Language-level isolation / software fault isolation.
  Powerful idea for building strong, OS/HW-independent sandboxes.
  Can be even lower overhead than OS/HW isolation: lower context switching costs.
  Correct compiler / runtime is critical to achieving security.
  WebAssembly designed to make compiler/runtime job easier in terms of security.