6.858 Lecture Symbolic Execution today's topics: finding bugs, symbolic execution, EXE can we automate the search for security vulnerabilities? basically the same as finding bugs; let's borrow bug-finding ideas intro, then symbolic execution very powerful idea you'll use it in lab 3 the overall picture we have some code that might have bugs it has inputs, which the attacker can provide (e.g. web requests) we add checks (e.g. assertions) for bad situations that might arise the checks could be too expensive to use in the real system we want to see if there are inputs the attacker could provide that would cause the code to fail the checks can the attacker use this too? yes! what kinds of bugs? divide by zero, null-pointer dereference, out-of-bounds array access application-specific bugs (through asserts) bank transaction changed sum of all accounts X's transaction decreased Y's balance program opened a file with name determined by input request sets cookie to impersonate another user application-specific asserts in programs often turned off during deployment they are sometimes expensive we rely on testing with asserts turned on why not directly prove that the program is correct? this can be done but it's a lot of work not yet practical for big programs someday you could write a suite of tests (and you should) each test runs the program with some input the test knows the expected output the asserts in the code get a chance to catch errors pro: good for intended functionality pro: good for bugs that are already known con: not so good for unintended functionality == vulnerabilities con: not so good for as-yet-unknown bugs con: hard to write tests that get complete "coverage" == cause every line of code to execute if no test causes some line of code to execute, you can't spot bugs in it con: takes effort to write tests can we automate test-case generation? can we automate search for as-yet-unknown bugs? can we automate generation of tests that achieve good coverage? fuzzers idea: execute program on lots of randomly-generated inputs 1. find input sources command-line arguments HTTP requests 2. write input generation code completely random inputs unlikely to get very "deep" if method == "GET" ... usually need smart input generator: generate syntactically correct input with random content where freedom is allowed GET /xxx GET /transfer?from=xxx&to=yyy&amount=zzz use test suite to help generate inputs 3. execute with random inputs until you get bored maybe some random inputs will trigger an assert that you didn't think an attacker could violate e.g. maybe bank code doesn't check that "from" account has enough balance to cover a transfer; fuzzing may discover relevant arguments, triggering an assert that checks for negative balances or conservation of balances. do fuzzers work? they are widely used and have found lots of bugs particularly good at bugs like buffer overflow that may not need a specific input value to trigger them many out-of-bound values are likely to generate crash don't even have to write assert advantage: better than programmer at testing for unexpected behavior advantage: no need to control entire execution env can fuzz black-box systems over the network as long as there's a way to detect errors advantage: may not need source code disadvantage: uses lots of CPU time disadvantage: hard to cover everything can miss bugs because didn't happen to try a particular input e.g. if command == "credit" or if inputs must have complex structure e.g. hard to test a compiler with purely random input symbolic execution: a more sophisticated testing scheme e.g. EXE new execution semantics builds expressions that relate inputs to variable values calculates an input that drives program down each side of each "if" an assert is an "if" so EXE will try to find inputs that cause each assert to fail example: 1. read x, y 2. if x > y: 3. x = y 4. if x < y: 5. x = x + 1 6. if x + y == 7 7. error() line 6+7 is the expansion of assert x + y != 7 this is a simple example of code that's hard for humans or fuzzers to test i.e. to decide if attacker could trigger the error requires some careful thought by human requires a lot of luck in a random fuzzer EXE executes with symbolic values i.e. variable/memory holds an *expression* in terms of inputs x and y hold the example's inputs they hold symbolic values -- not concrete values let's say x=alpha and y=beta EXE remembers which memory locations hold symbolic values and what each location's current symbolic value is EXE remembers "constraints" imposed by executed if statements the "path constraints" (pc) EXE views execution as a tree, which splits at each if EXE executes down one side of each if, then down the other the "if" adds the condition to one execution's pc, and "not" to the other when EXE gets to the error() call, it checks whether the current path constraints can be satisfied EXE uses the STP constraint solver if yes, STP indicates the satisfying values of input variables if attacker inputs those values, program will call error() one path: 1. pc = { }, x = a, y = b (alpha, beta) 2. pc = { a > b }, x = a, y = b (fork) 3. pc = { a > b }, x = b, y = b 4. same (no fork) 6. cannot satisfy b + b == 7 skips if, continues executing another path: 2. pc = { a <= b }, x = a, y = b 4. pc = { a <= b AND a < b }, x = a, y = b 5. pc = { a <= b AND a < b }, x = a + 1, y = b 6. (a + 1 + b) == 7 for a < b? lots of solutions e.g. a=0 b=6 EXE would report an assert failure w/ inputs e.g. a=0 b=6 symbolic execution is very powerful it looks at the program to figure out bug-provoking inputs! it understands what "x + y == 7" implies about the input EXE is a C-to-C translator -- it transforms C code, then compiles w/ gcc can handle all of C except floating point state: table indicating which memory ranges are symbolic value for symbolic range is a symbolic array of bytes each byte is 8 symbolic bits one representation independent of C type path constraint 1. EXE adds code to every assignment, expression, and branch if any argument symbolic, mark result symbolic, record sym value if all arguments concrete, execute faster ordinary operation 2. fork() at each branch add if-condition constraint (or "not") to pc in each process Figure 9 has a fragment of a real example packet filter this is what tcpdump and many other network monitoring apps use user (attacker) supplies an interpreted filter in a simple language kernel interprets filter to decide whether user wants to see each packet we're worried about evil user supplying filter that tricks the kernel Figure 9 is called when filter wants to read "len" bytes at "offset" there is code to check that filter isn't reading beyond the end of the packet what is the problem? how does EXE spot it? EXE generates a tree of processes corresponding to execution tree by fork()ing at each "if" Q: how big will execution trees grow? is it a problem? Q: can we merge all tree nodes that refer to same if? i.e. does one symbolic execution represent all possible executions of that line? Q: how long might it take EXE to run? how does EXE handle all those fork()ed processes? each contacts "search server" and waits which process should search server allow to run? depth-first search? pro: executes deep into program con: can get stuck in loops with symbolic bounds thus may never execute many lines of code breadth-first search? pro: doesn't get stuck, since tries many paths a little bit con: may never get very far into the program EXE search server uses "best-first" heuristic: line of code that's been run the fewest times (much like breadth-first) use DFS on that process and children "for a while" The Question Each time EXE adds a branch constraint it queries STP to check that there exists at least one solution for the current path's constraints. What would go wrong if EXE did not use STP, and instead tried all branches? What would go wrong if EXE randomly selected a branch to check? (i.e. followed just one, not both) how does a constraint solver work? this is the really hard part of symbolic execution constraint solver solves sets of equations easy: x + y = 10 AND x = y hard: 900 = x*x -- requires a trick -- STP knows many tricks too hard: 10 = sha1(x) -- will time out arrays can be tough for a constraint solver EXE turns many C constructs into arrays (strings, ptrs, structs?) s[c] -- concrete index lets STP treat s[c] as a specific symbolic value this is the easiest case -- and the most common e.g. looping over an input string c[s] -- could refer to any element, since s is symbolic equivalent to a big disjunction (c[0] or c[1] or ...) *p -- if p is symbolic, which array? i.e. which disjunction? what if the constraint solver times out? effectively treat it as "unsatisfiable" if solving at termination (e.g. error()) -- unsat means print nothing if solving at division/dereference -- unsat means assume there's no error if solving at some "if" branch -- unsat means prune very slow, so optimizations are critical ordinary concrete operations/operands when possible don't bother with if-branch if no solution cache+share constraint solutions (4.1) solve and cache independent constraint fragments (4.2) solver knows a lot about arrays (3.3) are there situations that EXE can't handle? no floating point no interaction with O/S e.g. open(symbolic-file-name) if EXE finds no bugs -- does that mean there are no bugs? no: EXE doesn't know what a bug is it finds crashes + asserts, but no other problems no: STP might run out of time before finding a solution i.e. some input could cause assert to fail, but STP can't find it no: EXE may not explore all paths there may be a vast # of paths, programmer may give up before EXE tries them all Evaluation does EXE find real bugs? how fast? EXE finds real bugs in smallish UNIX utility code packet filter vs evil filters udhcpd vs evil packets pcre (perl compatible regular expressions) vs evil regular expressions kernel file system vs corrupt file system disk images impressive -- real C programs, real bugs! mostly buffer overflow / illegal memory references these are errors EXE can find w/o programmer help would take more programmer help to find application-specific bugs e.g. missing permission checks how fast? Table 2 gives run-time for above programs (bpf, udhcpd, pcre) tens of minutes -- not so bad but complexity might be exponential in program size... Lab 3 uses "concolic execution", a variant of symbolic execution problem: what if there are functions that you can't look inside? as when layering symbolic execution on top of a complex language for Lab 3, adding symb exe to Python w/o modifying Python example: read x, u ok = DBlookup(u) if x == "GET": if ok == True: ... else ... if we don't have a symbolic DB, we cannot execute this symbolically concolic execution execute with concrete inputs -- e.g. empty string so we can execute the DBlookup in the example it's an ordinary concrete (non-symbolic) execution while executing: record symbolic values of variables derived from inputs when possible maintain path constraint of executed path just one path, since concrete inputs only explore one side of each "if" after execution finishes: negate an "if" condition in the pc solve modified pc (up to that "if"), yielding new concrete inputs re-execute on new concrete inputs new execution will follow a different path than first keep re-executing with different "if" conditions negated eventually can drive execution down lots of different paths and perhaps find inputs that trigger assertion failures concolic pro/con pro: much easier to add to a language like Python "proxy" concolic data types replace int, string pro: an easy way to tolerate opaque functions con: will miss some constraints, e.g. relation of ok to u thus may not be able to execute down some "if" branches conclusion symbolic execution is powerful and productive but not so practical as programs grow large it's a promising research area as well as a useful tool research tools: S2E (symbolic execution for qemu) used in practice at bug-finding companies (e.g., GrammaTech) used at Microsoft internally (e.g., SAGE)