Buffer overflows
================

Administrivia
    Introduce TA: Xi Wang.  TA office hours on Thursday 3-5pm.
    Any questions about lab1?
    There were a few (unintended) bugs in lab1.
    To get fixes, run "git pull".
    Today's lecture: memory safety errors, and in particular, buffer overflows.

Recall what's going on with buffer overflows (last lecture, 6.033, ..)
    Consider the following example code from, say, a web server:

	void read_req(void) {
	    char buf[128];
	    int i;
	    gets(buf);
	    .. parse buf ..
	}

    What does the compiler generate in terms of memory layout?
	x86 stack:
	    Stack grows down.
	    %esp points to the last (bottom-most) valid thing on the stack.
	    %ebp points to the caller's %esp value.

                         +------------------+
        entry %ebp ----> | .. prev frame .. |
			 |                  |
			 |                  |
                         +------------------+
        entry %esp ----> |  return address  |
                         +------------------+
        new %ebp ------> |    saved %ebp    |
                         +------------------+
                         |     buf[127]     |
                         |       ...        |
                         |      buf[0]      |
                         +------------------+
        new %esp ------> |        i         |
                         +------------------+

	Caller's code (say, main):
	    call read_req

	read_req's code:
	    push    %ebp
	    mov     %esp -> %ebp
	    sub     168, %esp		# stack vars + some other junk
	    ...
	    mov     %ebp -> %esp
	    pop     %ebp
	    ret

    What's the threat model, policy?
	Assume that adversary can connect to web server, supply any inputs.
	Policy is a bit loose: only perform operations intended by programmer?
	In practice: don't want adversary to send spam, steal data, install bot.

    How does the adversary take advantage of this code?
	Supply log input, overwrite data on stack past buffer.
	Interesting bit of data: return address, gets used by 'ret'.
	Can set return address to the buffer itself, include some code in there.

    How does the adversary know the address of the buffer?
	What if one machine has twice as much memory?
	Luckily for adversary, virtual memory makes things more deterministic.
	For a given OS and program, addresses will often be the same.

    What happens if stack grows up, instead of down?
	Look at the stack frame for gets.

    What can the adversary do once they are executing code?
	Use any privileges of the process.
	Often leverage overflow to gain easier access into system.
	    Originally on Unix, run shell /bin/sh (thus, "shell code").
	If the process is running as root or Administrator, can do anything.
	Even if not, can still send spam, read files (web server, database), ..
	Can attack other machines behind a firewall.

    Why would programmers write such code?
	Legacy code, wasn't exposed to the internet.
	Programmers were not thinking about security.
	Many standard functions used to be unsafe (strcpy, gets, scanf).
	Even safe versions have gotchas (strncpy does not null-terminate).

    More generally, any memory errors can translate into a vulnerability.
	Using memory after it has been deallocated (use-after-free).
	    If writing, overwrite new data structure, e.g. function ptr.
	    If reading, might call a corrupted function pointer.
	Freeing the same memory twice (double-free).
	    Might cause malloc to later return the same memory twice.
	Might not even need to overwrite a return address or function pointer.
	    Can suffice to read sensitive data like an encryption key.
	    Can suffice to change some bits (e.g. int isLoggedIn, int isRoot).

How to prevent buffer overflows?
    Two things going on in a "traditional" buffer overflow:
	1: Adversary gains control over execution (program counter).
	2: Adversary executes some malicious code.
    What are the difficulties to these two steps?
	1: Requires overwriting a code pointer (which is later invoked).
	    Common target is a return address using a buffer on the stack.
	    Any memory error could potentially work, in practice.
	    Function pointers, C++ vtables, exception handlers, ..
	2: Requires some interesting code in process's memory.
	    Often easier than #1.
	    Process already contains a lot of code.
	    Process accepts inputs that adversary can supply.
	    (But, adversary needs to find a predictable location.)

Approach 0: use a type-safe language (Java, C#, Python?, ..)
    Good: does prevent memory corruption errors.
	Except that low-level bindings still need to be correct.
	E.g. language runtime itself, or generic bindings like Python ctypes.
    Bad: still have a lot of legacy code in unsafe languages.
    Bad: might not perform as well as a fine-tuned C application?
	Used to be a bigger problem.
	Hardware & high-level languages are getting better.

Approach 1: try to find bugs (analysis tools, fuzzers, ..)
    Works in practice; will look at one particular analysis tool next week.
    Generally, hard to prove the absence of bugs, esp. for unsafe code like C.
    But, even partial analysis is useful (e.g. for Baggy bounds checking).

Approach 2: canaries (StackGuard, gcc's SSP, ..)
    Idea: OK to overwrite code ptr, as long as we catch it before invocation.
    One of the earlier systems: StackGuard
    Place a canary on the stack upon entry, check canary value before return.
	Usually requires source code; compiler inserts canary checks.
	Where is the canary on the stack diagram?
    Make the canary hard to forge:
	"Terminator canary": four bytes (0, CR, LF, -1)
	    Idea: many C functions treat these characters as terminators.
	    As a result, if canary matched, then further writes didn't happen.
	Random canary: much more common today (but, need good randomness!)
    What kinds of vulnerabilities will a stack canary not catch?
	Overwrites of function pointer variables before the canary.
	Heap object overflows (function pointers, C++ vtables).
	Overwrite a data pointer, then leverage it to do arbitrary mem writes.
	    int *ptr = ...;
	    char buf[128];
	    gets(buf);
	    *ptr = 5;
	Write directly to return address past the canary (for some buggy code).
    How could you trick the canary?
	Guess or obtain the canary value (if random)
	    Maybe application leaks random values from its memory?
	    Remove null terminator from a string, app will read later bytes.
	Overwrite the authentic canary value?
	    If vulnerability allows arbitrary mem writes, can ovewrite it.

Approach 3: bounds checking
    Overall goal: prevent pointer misuse by checking that pointers are in range.
	Sometimes hard to tell in C what's an overflow vs. what's legitimate.
	    Suppose program allocates integer array, int x[1024];
		    program also creates pointer, e.g. int *y = &x[107];
	    Is it OK to increment y to access subsequent elements?
	    If it's meant to be used like a string buffer, maybe yes.
	    If it's meant to store bank account balances of many users, no.
	Gets even more difficult with mixed structs or unions.
	Usually, the plan is to at least enforce bounds between malloc objects.
	    Or variables allocated by the compiler, either global or on stack.
    Electric fence: simple debugging tool from a while back.
	Align allocated memory objects with a guard page at end or beginning.
	Use page tables to ensure that accesses to guard page cause a crash.
	Reasonably effective as a debugging technique.
	Can prevent some buffer overflows for heap objects, but not for stack.
	Advantage: works without source code, no compiler changes.
	    Need to be able to replace default malloc with efence's malloc.
	Problem: huge overhead (only 1 object per page, + guard pages).
    Fat pointers.
	Idea: modify a pointer representation to include bounds information.
	    Requires a different compiler, access to source code.
	Usually called "fat pointer" because the pointer becomes much larger.
	E.g. simple scheme: 4-byte pointer, 4-byte obj base, 4-byte obj end.
	Compiler generates code to aborts if dereferencing pointer whose
	    address is outside of its own base..end range.
	Problem 1: can be expensive to check all pointer dereferences.
	Problem 2: incompatible with a lot of existing code.
	    Cannot pass fat pointer to unmodified library.
	    Cannot use fat pointers in fixed-size data structures.
	    Fat pointers are not atomic (some code assumes ptr writes are).
    Shadow data structures to keep track of bounds (Jones and Kelly, Baggy).
	Requires a significant amount of compiler support.
	For each allocated object, keep track of how big the object is.
	    E.g. record the value passed to malloc: char *p = malloc(256);
	    Or, for static variables, determined by compiler: char p[256];
	For each pointer, need to interpose on two operations:
	    1. pointer arithmetic: char *q = p + 256;
	    2. pointer dereferencing: char ch = *q;
	Need to ensure that out-of-bound pointers cannot touch other data.
	Challenge 1: looking up limits information for a regular pointer.
	    Naive: hash table or interval tree for addrs.  Slow lookup.
	    Naive: array w/ limit info for each memory addr.  Memory overhead.
	Challenge 2: causing out-of-bounds pointer dereferences to fail.
	    Naive: interpose on every pointer dereference.  Expensive.
	What's the trick from the paper?
	    1. Align, round up allocations to powers of 2: 5 bits of limit.
	    2. Store limit info in a linear array: fast lookup.
	    3. Allocate memory at slot granularity: fewer array entries.
	    4. Use virtual memory system to prevent out-of-bound derefs.

	    What's the cost of these tricks?
	    What's a buddy allocator?
	    Why powers of 2?
		Also easy to check: base = p & ~(size-1).
		(p ^ p' >> table[p >> slot_size]) == 0
	What is the data structure that Baggy has in memory?
	    Array at fixed virtual address, one byte per 16-byte mem "slot".
	    Using virtual memory to allocate this array on-demand.
	Why don't these guys throw an error when arithmetic goes out-of-bounds?
	    Applications simulate 1-indexed arrays.
	    Applications want some value to represent one-past-the-end.
	What does their pointer look like?
	    Normal pointer for in-bounds values.
	    High-bit-set for out-of-bounds (within half a slot).
		Typically, OS kernel lives in upper half, protects itself.
	    Why half a slot for out-of-bounds?
	What happens when pointer arithmetic goes out-of-bounds?
	    Instrumented code jumps to slow path, computes OOB pointer value.
	    New pointer value points to "kernel memory", will cause crash.
	    If OOB pointer converted to in-bounds pointer, high bit is cleared.
	    In common case (all in bounds), no extra overhead.
	So what's the answer to the homework problem?
	    char *q's pointer value has high bit set (out-of-bounds).
	    Dereferencing will cause a crash ("Segmentation fault").
	Does Baggy instrument every memory address computation & access?
	    Static analysis could prove some addr is always safe (no details).
	    What's considered an "unsafe variable"?
		Address of variable taken, and not all uses can be proved safe.
	    What does Baggy do with function call arguments?  Why?
		Cannot change, because x86 calling convention is fixed.
		Copy unsafe args to separate area, which is aligned & protected.
	How do they get binary compatibility with existing libraries?
	    Normal pointers for in-bounds values.
	    Set the bounds info to 31 (2^31 bound) for de-allocated memory.
	Why do they instrument strcpy and memcpy?  Do they need to?
	How does their 64-bit scheme work?
    Can bounds checking still be bypassed?
	Could exploit vulnerability in un-instrumented libraries.
	Could exploit temporal vulnerabilities (use-after-free).
	Mixed buffers and code pointers:

	    struct {
		char buf[256];
		void (*f) (void);
	    } my_type;
	    struct my_type s;

	An overflow of s.buf will corrupt s.f, but not flag a bounds error.
	Would re-ordering f and buf help?
	    Might break applications that depend on struct layout.
	    Might not help if this is an array of (struct my_type)'s.
    What are the costs of bounds checking (e.g. Baggy)?
	Space overhead for limits information (fat pointer or extra table).
	Space overhead for extra memory used by padding (Baggy).
	CPU overheads for pointer arithmetic, dereferencing.

Approach 4: non-executable memory (AMD's NX bit, Windows DEP, W^X, ..)
    Modern hardware allows specifying read, write, and execute perms for memory.
	R, W permissions were there a long time ago; execute is recent.
    Can mark the stack non-executable, so that adversary cannot run their code.
    More generally, some systems enforce "W^X", meaning all memory is either
	writable, or executable, but not both.  (Of course, OK to be neither.)
    Advantage: potentially works without any application changes.
    Shortcoming: harder to dynamically generate code (esp. with W^X).
	JITs like Java runtimes, Javascript engines, generate x86 on the fly.
	Can work around it, by first writing, then changing to executable.
	Java runtime used to mark all memory W+X (compatible, but not ideal).
    Can this still be exploited?
	Programs already contain a lot of code, might not need new code.
	Arc injection / return-to-libc attacks.
	Particularly useful functions: system, execl, unlink, strcpy, memcpy, ..
	General technique: "return-oriented programming"

Approach 5: randomized memory addresses (ASLR, stack randomization, ..)
    Make it difficult for adversary to guess a valid code pointer.
	Stack randomization: move stack to random locations, random offsets.
	    Adversary doesn't know what to put for start of buf in return addr.
	Randomize entire address space (Address Space Layout Randomization):
	    Rely on the fact that a lot of code is relocatable.
	    Dynamic loader can choose random address for each library, program.
	    Adversary doesn't know address of system(), etc.
    Can this still be exploited?
	Adversary might guess randomness.
	    Especially on 32-bit machines, not a lot of random bits.
	    32-bit address: 1 bit for kernel, 12 bit for page, at most 19 left.
	    Most systems have more restrictions, leading to 8-16 random bits.
	    More practical on 64-bit machines (easily 32 bits of randomness).
	Adversary might extract randomness.
	    Programs might print a stack trace or error message w/ pointer.
	    If adversary can run some code, they might get real addresses.
	    Cute address leak in Flash's Dictionary (hash table):
		Get victim to visit your Flash-enabled page (e.g. buy an ad).
		Hash table internally computes hash value of keys.
		Hash value of integers is the integer.
		Hash value of object is its address.
		Iterating over a hash table is done in hash bucket order.
		Can compare address to integer, guess object address.
		Now, can exploit some code pointer overwrite and bypass ASLR.
	Adversary might not care exactly where to jump.
	    "Heap spraying": fill memory w/ shellcode so that random jump is OK.
	Adversary might exploit some code that's not randomized (if exists).
    Some other interesting uses of randomization elsewhere.
	System call randomization (each process has its own system call numbers)
	Instruction set (perhaps machine or SQL language) randomization.

Are these techniques usable?  Should we use them?
    gcc (and MSVC) enable stack canaries by default.
    Linux, Windows include ASLR, NX by default.
    Fuzzing / testing is done a fair amount.
    Bounds checking not as common (performance overheads, false positives).
    Buffer overflows aren't #1 anymore; the web has "won" (SQL injection, XSS).

References:
    http://www.semantiscope.com/research/BHDC2010/BHDC-2010-Paper.pdf