Verification for security ========================= Why are we discussing this paper? Formal verification can eliminate bugs (prove a strong security guarantee). Big deal for security: many vulnerabilities are due to bugs/mistakes. This paper presents a fairly complete story for a meaningful security proof. Research prototype, but getting close to being useful. What is formal verification? Goal: ensure there are no bugs. What does this even mean, and how to get there? From last lecture: Need to know when a bug occurs (or what is an incorrect execution). Need to consider all possible executions and check that there's no bugs. Both parts (definition and checking) seem tricky. One correctness definition: compiler and/or developer puts in lots of assertions. This is what EXE from last lecture used as its correctness goal. Assertions indicate bugs, but is is enough to avoid assertions? Typically, want to prove some strong property about correctness or security. Even if program doesn't crash, want the end result to be right. This is typically called a specification (spec for short). Example: sort() produces a permutation of input in sorted order. Example: checkpw() returns OK if password is correct for the user logging in. Contrast with bug-finding: focus on overall correctness rather than the bug. Doesn't matter what the bug is: our goal isn't to pin-point the bug. No need to flag bug early like in EXE. We will need to reason about execution all the way to the end. If we prove correctness, the end result is correct for all runs. Doesn't matter if some code looks buggy. We know it's correct the way it's used to achieve the spec! How to consider possible executions? Symbolic execution might work, if we can enumerate all possible paths. Works well for "straight-line" code. What about loops? while True: req = get_request() if req is not None: break .. process req .. There's arbitrarily many possible executions (for every number of iterations). Cannot reason purely with symbolic execution: arbitrarily many paths. Need some more sophisticated plan to reason about any number of iterations. 1. Describe the state of being in the loop using some invariant. E.g., loop iteration counter is arbitrary, but everything else is unchanged. 2. Prove that loop invariant holds under iteration. 3. Use the loop invariant to reason about execution paths that exit the loop. Every possible execution now is a combination of steps 2 + 3. Different proof systems handle this loop-invariant-style problem differently. In some cases, developer writes loop invariant. In some cases, verification tool guesses the loop invariant. Knox has a combination of both. What's the setting where this paper wants to apply formal verification? Hardware security modules (HSMs). Example: USB security key. Example: secure enclave from the iPhone. Paper running example: PIN-protected backup. Want to store some secret key, protected by a short PIN (password). Want to prevent adversary from guessing all possible PINs. Limit number of guesses (e.g., 10 guesses max), like the iPhone does. Threat model: compromised host. Adversary might be running malware, possibly as root, on host computer. Want to ensure security (e.g., limit PIN guesses). Similar to iOS threat model: secure enclave doesn't trust main CPU. What kinds of bugs is the paper trying to eliminate? Buffer overflows. Long guess PIN overflows buffer, arbitrary code execution on HSM. Hardware bugs. JTAG debug interface allows interrupting HSM execution. Timing channels. Check one PIN digit at a time, return early. Important in our threat model: host CPU can precisely time HSM behavior. Hardware/software mismatch. Perhaps software developer misunderstood hardware interface. E.g., wrong assumption about some instructions (timing guarantees). E.g., wrong assumption about data durability on crash. E.g., wrong assumption about peripheral device behavior. Why might formal verification useful here? Hard to reason about timing channels, side channels. Hard to reason about bugs that span hardware/software boundary. HSMs are critical for security, so important to avoid bugs. Why might formal verification be viable here? Relatively small implementations: simple CPU, not much software. Relatively simple functionality: specific well-defined operations. Relatively straightforward code: not much looping, no threads, etc. How to define correctness? Write the expected behavior of the HSM: Figure 2. One tricky part: HSM doesn't implement function calls, it has wires. Driver. Correctness: using driver against the HSM gives same results as spec. Correctness is with respect to a driver (so, it's part of the spec). If host doesn't communicate correctly to the HSM, guarantee doesn't hold. How to define security? Informally: adversary should not be able to obtain anything from HSM beyond what it should provide (like the behavior from Figure 2). Formalizing this notion: emulator. Has access to an oracle of what the spec allows. But otherwise doesn't have secrets: stateless. Cannot remember anything from previous invocations, aside from spec. Why does the emulator definition capture security? Emulator has no access to additional information. If adversary had access to ideal spec, could emulate HSM behavior. How could we construct the emulator? Clone implementation, minus the secrets. Same CPU, same code, etc. Interrupt execution at certain points to query specification oracle. Inject specification result somewhere in the circuit. IPR combines functionality and security: Figure 4. Intuition: real HSM should be indistinguishable from the spec. But that's not quite right: real HSM is hardware, spec is abstract. Driver and emulator do the translation between HW + functional API. Can't talk about both functional and HW access simultaneously. Mode switch. Can switch between functional steps. Cannot switch while a functional invocation is "in progress". (But could have gone back and done the same thing at physical level.) IPR applies to entire device: hardware + software. Does IPR mean we can't have bugs we were trying to prevent? What would go wrong if we had some bug? Buffer overflow: adversary could do any operation by overflowing buffer. Physical interface lets adversary do anything they want (e.g., extract key). But emulator can't match that by accessing the spec oracle. Hardware bug, hw/sw mismatch. Could be a correctness issue: driver can't match ideal spec. Or could be leaking, security issue: emulator can't match physical behavior. Timing issue (e.g., digit-by-digit PIN comparison). Real implementation returns result on earlier cycle depending on digit. Emulator cannot determine when to return the result: not in spec oracle. No faithful emulator exists. How to reason about possible executions? Section 4. Similar to symbolic execution from the EXE paper. Input: hardware implementation (RTL), with software as ROM contents. Can run one cycle of the hardware, for particular state and inputs. State: RAM contents, and internal CPU state (e.g., registers). Inputs: wire signals from the host computer. Result: new state, and outputs (wire signals back to host computer). Some of the inputs and state could be symbolic, as in EXE. Can also run driver, emulator over symbolic state. As a result, some of the outputs and new state could also be symbolic. Symbolic inputs allow Knox to reason about many executions at once! Demo: cd ~/proj/knox-hsm/lockbox/impl cat lockbox.v cd ../spec cat spec.rkt cat driver.rkt cd ../proof make cat lockbox.smt2 cat shared.rkt racket correctness.rkt cat emulator.rkt racket security.rkt cd ~/proj/knox-hsm/pin-protected-backup/impl cat fw/pin.c make cd ../spec cat spec.rkt cd ../proof make racket correctness.rkt [.. takes a long time ..] racket security.rkt [.. also takes a long time ..] Aside: traditional hardware verification with "model checking". Run the circuit on specific inputs. Run every possible input. Viable for some small circuits (e.g., 2^32 for a 16-bit-by-16-bit multiplier). But our model has too many inputs: input wires at every cycle. How does Knox consider all possible executions? Meta-level plan: break up execution into various parts (sub-proofs). Prove each part's sub-proof, mostly using symbolic execution. On-paper (or in Coq) proof that the parts (sub-proofs) are complete. Why does Knox need the refinement relation R? Sort-of like a loop invariant for the overall system loop that keeps calling HSM. Don't want to keep reasoning about arbitrarily long execution from scratch. Instead, want to reason about short stretches of execution (sub-proofs). Start with R, run some operation, prove that R holds again. Can inductively prove that all of the sub-proofs chain together. R connects implementation and spec state: captures history on both sides. For the PIN-protected backup example, Figure 5 has the R. Parts (sub-proofs) for functional equivalence (Sections 4.2, 4.3, 4.5). Prove that the initial state matches R. Prove that, starting from R, running the driver gives you the correct result (+ R). Figure 7. Prove that, starting from R, running the driver and crashing is atomic (+ R). Figure 9. Parts (sub-proofs) for physical equivalence (Section 4.4). Start from some implementation state that matches R (connected to spec state). Prove that providing any inputs to real impl behaves same as emulator. Reset at the end ("recovery"), at which point R has to hold again. What kinds of bugs does Knox find? PIN backup: strcmp for PIN, one character at a time. PIN backup: persistence. Adversary can implement a "reset guess count" operation. Send PIN guess 0000. Wait long enough for guesses=0 (but not long enough for guesses++). Reset device, repeat with PIN guess 0001, ..., 9999. Can now keep guessing PINs without a guess limit. Guess a few PINs, then do the above reset to get more guesses. Hash HSM: parity. Implementation stores two copies of state, for atomic commit. Code had if statement that took longer to access one copy than another. Implementation took longer when active state was one vs another. Spec doesn't give any hints to help emulator produce this behavior. TOTP: division time. Spec says to reveal last 6 digits of hash. Naive code in C: int result = h % 1000000; But CPU divide instruction (used for %) is variable-time! CPU implements divide by repeated shifts, subtraction, etc. Had to implement manual constant-time modulo in assembly. === Notes on running Knox on Arch Linux: pacman -S riscv64-linux-gnu-gcc yosys racket python -m pip install bin2coe ( cd ~/proj/knox && raco pkg install ) cd ~/proj/knox-hsm/.../proof make racket correctness.rkt racket security.rkt