6.858 Lecture Symbolic Execution

today's topics: finding bugs, symbolic execution, EXE
  can we automate the search for security vulnerabilities?
  basically the same as finding bugs; let's borrow bug-finding ideas
  intro, then symbolic execution
    very powerful idea
    you'll use it in lab 3

the overall picture
  we have some code that might have bugs
  it has inputs, which the attacker can provide (e.g. web requests)
  we add checks (e.g. assertions) for bad situations that might arise
    the checks could be too expensive to use in the real system
  we want to see if there are inputs the attacker could provide
    that would cause the code to fail the checks
  can the attacker use this too?
    yes!
  
what kinds of bugs?
  divide by zero, null-pointer dereference, out-of-bounds array access
  application-specific bugs (through asserts)
    bank transaction changed sum of all accounts
    X's transaction decreased Y's balance
    program opened a file with name determined by input
    request sets cookie to impersonate another user
  application-specific asserts in programs often turned off during deployment
    they are sometimes expensive
    we rely on testing with asserts turned on

why not directly prove that the program is correct?
  this can be done 
  but it's a lot of work
  not yet practical for big programs
  someday

you could write a suite of tests (and you should)
  each test runs the program with some input
    the test knows the expected output
    the asserts in the code get a chance to catch errors
  pro: good for intended functionality
  pro: good for bugs that are already known
  con: not so good for unintended functionality == vulnerabilities
  con: not so good for as-yet-unknown bugs
  con: hard to write tests that get complete "coverage"
       == cause every line of code to execute
       if no test causes some line of code to execute,
         you can't spot bugs in it
  con: takes effort to write tests

can we automate test-case generation?
can we automate search for as-yet-unknown bugs?
can we automate generation of tests that achieve good coverage?

fuzzers
  idea: execute program on lots of randomly-generated inputs
  1. find input sources
     command-line arguments
     HTTP requests
  2. write input generation code
     completely random inputs unlikely to get very "deep"
       if method == "GET" ...
     usually need smart input generator:
       generate syntactically correct input
       with random content where freedom is allowed
       GET /xxx
       GET /transfer?from=xxx&to=yyy&amount=zzz
     use test suite to help generate inputs
  3. execute with random inputs until you get bored
     maybe some random inputs will trigger an assert that
       you didn't think an attacker could violate
     e.g. maybe bank code doesn't check that "from" account
          has enough balance to cover a transfer;
          fuzzing may discover relevant arguments, triggering
          an assert that checks for negative balances or
          conservation of balances.

do fuzzers work?
  they are widely used and have found lots of bugs
    particularly good at bugs like buffer overflow that
    may not need a specific input value to trigger them
    many out-of-bound values are likely to generate crash
    don't even have to write assert
  advantage: better than programmer at testing for unexpected behavior
  advantage: no need to control entire execution env
    can fuzz black-box systems over the network
    as long as there's a way to detect errors
  advantage: may not need source code
  disadvantage: uses lots of CPU time
  disadvantage: hard to cover everything
    can miss bugs because didn't happen to try a particular input
      e.g. if command == "credit"
    or if inputs must have complex structure
      e.g. hard to test a compiler with purely random input

symbolic execution: a more sophisticated testing scheme e.g. EXE
  new execution semantics
  builds expressions that relate inputs to variable values
  calculates an input that drives program down each side of each "if"
  an assert is an "if"
    so EXE will try to find inputs that cause each assert to fail
  
example:
1. read x, y
2. if x > y:
3.   x = y
4. if x < y:
5.   x = x + 1
6. if x + y == 7
7.   error()

line 6+7 is the expansion of assert x + y != 7

this is a simple example of code that's hard for humans or fuzzers to test
  i.e. to decide if attacker could trigger the error
  requires some careful thought by human
  requires a lot of luck in a random fuzzer

EXE executes with symbolic values
  i.e. variable/memory holds an *expression* in terms of inputs
  x and y hold the example's inputs
    they hold symbolic values -- not concrete values
    let's say x=alpha and y=beta
  EXE remembers which memory locations hold symbolic values
    and what each location's current symbolic value is
  EXE remembers "constraints" imposed by executed if statements
    the "path constraints" (pc)
  EXE views execution as a tree, which splits at each if
  EXE executes down one side of each if, then down the other
    the "if" adds the condition to one execution's pc, and "not" to the other
  when EXE gets to the error() call,
    it checks whether the current path constraints can be satisfied
    EXE uses the STP constraint solver
    if yes, STP indicates the satisfying values of input variables
    if attacker inputs those values, program will call error()

one path:
1. pc = { }, x = a, y = b (alpha, beta)
2. pc = { a > b }, x = a, y = b (fork)
3. pc = { a > b }, x = b, y = b
4. same (no fork)
6. cannot satisfy b + b == 7
   skips if, continues executing

another path:
2. pc = { a <= b }, x = a, y = b
4. pc = { a <= b AND a < b }, x = a, y = b
5. pc = { a <= b AND a < b }, x = a + 1, y = b
6. (a + 1 + b) == 7 for a < b?
   lots of solutions e.g. a=0 b=6
   EXE would report an assert failure w/ inputs e.g. a=0 b=6

symbolic execution is very powerful
  it looks at the program to figure out bug-provoking inputs!
  it understands what "x + y == 7" implies about the input

EXE is a C-to-C translator -- it transforms C code, then compiles w/ gcc
  can handle all of C except floating point
  state:
    table indicating which memory ranges are symbolic
    value for symbolic range is
      a symbolic array of bytes
      each byte is 8 symbolic bits
      one representation independent of C type
    path constraint
  1. EXE adds code to every assignment, expression, and branch
     if any argument symbolic, mark result symbolic, record sym value
     if all arguments concrete, execute faster ordinary operation
  2. fork() at each branch
     add if-condition constraint (or "not") to pc in each process

Figure 9 has a fragment of a real example
  packet filter
    this is what tcpdump and many other network monitoring apps use
    user (attacker) supplies an interpreted filter in a simple language
    kernel interprets filter to decide whether user wants to see each packet
    we're worried about evil user supplying filter that tricks the kernel
  Figure 9 is called when filter wants to read "len" bytes at "offset"
    there is code to check that filter isn't reading beyond the end of the packet
    what is the problem?
    how does EXE spot it?

EXE generates a tree of processes corresponding to execution tree
  by fork()ing at each "if"

Q: how big will execution trees grow? is it a problem?

Q: can we merge all tree nodes that refer to same if?
   i.e. does one symbolic execution represent all possible executions
     of that line?

Q: how long might it take EXE to run?

how does EXE handle all those fork()ed processes?
  each contacts "search server" and waits
  which process should search server allow to run?
  depth-first search?
    pro: executes deep into program
    con: can get stuck in loops with symbolic bounds
         thus may never execute many lines of code
  breadth-first search?
    pro: doesn't get stuck, since tries many paths a little bit
    con: may never get very far into the program
  EXE search server uses "best-first" heuristic:
    line of code that's been run the fewest times (much like breadth-first)
    use DFS on that process and children "for a while"

The Question
  Each time EXE adds a branch constraint it queries STP to check that
  there exists at least one solution for the current path's constraints.
  What would go wrong if EXE did not use STP, and instead tried all
  branches? What would go wrong if EXE randomly selected a branch to
  check? (i.e. followed just one, not both)

how does a constraint solver work?
  this is the really hard part of symbolic execution
  constraint solver solves sets of equations
  easy: x + y = 10 AND x = y
  hard: 900 = x*x -- requires a trick -- STP knows many tricks
  too hard: 10 = sha1(x) -- will time out

arrays can be tough for a constraint solver
  EXE turns many C constructs into arrays (strings, ptrs, structs?)
  s[c] -- concrete index lets STP treat s[c] as a specific symbolic value
    this is the easiest case -- and the most common
    e.g. looping over an input string
  c[s] -- could refer to any element, since s is symbolic
    equivalent to a big disjunction (c[0] or c[1] or ...)
  *p -- if p is symbolic, which array? i.e. which disjunction?

what if the constraint solver times out?
  effectively treat it as "unsatisfiable"
  if solving at termination (e.g. error()) -- unsat means print nothing
  if solving at division/dereference -- unsat means assume there's no error
  if solving at some "if" branch -- unsat means prune

very slow, so optimizations are critical
  ordinary concrete operations/operands when possible
  don't bother with if-branch if no solution
  cache+share constraint solutions (4.1)
  solve and cache independent constraint fragments (4.2)
  solver knows a lot about arrays (3.3)

are there situations that EXE can't handle?
  no floating point
  no interaction with O/S e.g. open(symbolic-file-name)

if EXE finds no bugs -- does that mean there are no bugs?
  no: EXE doesn't know what a bug is
    it finds crashes + asserts, but no other problems
  no: STP might run out of time before finding a solution
    i.e. some input could cause assert to fail, but STP can't find it
  no: EXE may not explore all paths
    there may be a vast # of paths, programmer may give up before
      EXE tries them all

Evaluation
  does EXE find real bugs?
  how fast?

EXE finds real bugs in smallish UNIX utility code
  packet filter vs evil filters
  udhcpd vs evil packets
  pcre (perl compatible regular expressions) vs evil regular expressions
  kernel file system vs corrupt file system disk images
  impressive -- real C programs, real bugs!

mostly buffer overflow / illegal memory references
  these are errors EXE can find w/o programmer help
  would take more programmer help to find application-specific bugs
    e.g. missing permission checks

how fast?
  Table 2 gives run-time for above programs (bpf, udhcpd, pcre)
  tens of minutes -- not so bad
  but complexity might be exponential in program size...

Lab 3 uses "concolic execution", a variant of symbolic execution
  problem: what if there are functions that you can't look inside?
  as when layering symbolic execution on top of a complex language
    for Lab 3, adding symb exe to Python w/o modifying Python
  example:
    read x, u
    ok = DBlookup(u)
    if x == "GET":
      if ok == True:
        ...
      else
        ...
  if we don't have a symbolic DB, we cannot execute this symbolically

concolic execution
  execute with concrete inputs -- e.g. empty string
    so we can execute the DBlookup in the example
    it's an ordinary concrete (non-symbolic) execution
  while executing:
    record symbolic values of variables derived from inputs
      when possible
    maintain path constraint of executed path
      just one path, since concrete inputs only explore one side of each "if"
  after execution finishes:
    negate an "if" condition in the pc
    solve modified pc (up to that "if"), yielding new concrete inputs
    re-execute on new concrete inputs
    new execution will follow a different path than first
  keep re-executing with different "if" conditions negated
    eventually can drive execution down lots of different paths
    and perhaps find inputs that trigger assertion failures

concolic pro/con
  pro: much easier to add to a language like Python
       "proxy" concolic data types replace int, string
  pro: an easy way to tolerate opaque functions
  con: will miss some constraints, e.g. relation of ok to u
       thus may not be able to execute down some "if" branches

conclusion
  symbolic execution is powerful and productive
  but not so practical as programs grow large
  it's a promising research area as well as a useful tool
    research tools: S2E (symbolic execution for qemu)
    used in practice at bug-finding companies (e.g., GrammaTech)
    used at Microsoft internally (e.g., SAGE)