Static Analysis =============== What's the goal of this paper? Static analysis of Java applications Find vulnerabilities by analyzing application code, without running it Vulnerabilities here are some known class of programming mistakes What are advantages/disadvantages of static analysis, what are alternatives? Alternative: better development practices, careful coding (hard for people) Alternative: write test cases, or regression tests (good idea, incomplete) Alternative: fuzz-testing, probe with random inputs (effective, incomplete) More general forms: dynamic analysis, symbolic execution, .. + for static analysis: can be complete, guarantee no errors of certain kind + for static analysis: can run easily at development time, and fix errors (as opposed to dynamic analysis much later, may be too late to fix) - for static analysis: false positives, doesn't generate a real exploit - for static analysis: may be expensive (CPU time) to run What kinds of vulnerabilities is this paper interested in? 'Unchecked input vulnerabilities' Basic idea: input came from user (could be adversary), used in unsafe way Example attacks that fit this pattern: SQL injection Cross-site scripting Path traversal HTTP response splitting For each attack: Where is the input coming from? Where is it getting used in an unsafe way? One interesting corner case: whois-based XSS attack in phpBB Why are unchecked input vulnerabilities particularly common? Input comes from user in some function A() Input value gets used (potentially unsafely) in another function B() In A(), programmer forgets or doesn't know value will be used unsafely In B(), programmer forgets or doesn't know if value came from the user Why not always check in A()? Why not always check in B()? How would programs avoid these vulnerabilities? White-listing Black-listing Sanitization Look at examples of each approach for the 4 attacks we consider above High-level plan for static analysis: Notion of 'tainted' objects: objects that contain unchecked inputs What does a specification of a vulnerability look like? Three 'descriptors' (pattern-matching rules) define the vulnerability Sources Sinks Derivations Source descriptors specify what data is assumed 'tainted' to start with Derivation rules: how one tainted object can impart taint on another object Goal is to determine if a sink descriptor can ever be tainted Can you share these specifications between vulnerabilities? Sources: probably a good idea to use the same rules. Base sources: reading POST data, GET arguments, request path, headers... Want to get all untrusted sources in an application. Adversary controls the same inputs regardless of the vulnerability. Sinks: probably application-specific. E.g. opening a file for path traversal. SQL query for SQL injection. outputting HTML to user for cross-site scripting. Derivations: good to share, but might need to be app-specific. Base derivation statements: StringBuffer.append(String), StringBuffer.toString() What happens to the standard Java string concatenation with +? Compiler rewrites s1+s2 into: new StringBuffer().append(s1).append(s2).toString() What if you implement your own string copy function? When would you need to write specialized derivation functions? Byte-level copying between strings (base derivations don't track it) Native code that copies around data (maybe for efficiency?) Control-flow-based copying of tainted data Common cases: switch statement for each character of a string array lookup for each character of a string Need derivation statement for any function that does this What does this static analysis tool analyze? This paper's tool analyzes Java bytecode. Java's bytecode is sufficiently type-safe to make it a good target. For other languages (like C, C++, ..) may want source code. Why? Harder to find (likely) invariants in assembly, easier in source. E.g. if/then/else branches, distinct struct members, .. What happens during static analysis? Analyzes the Java bytecode in a function. Considers the set of objects that might exist at runtime. Some objects are tainted (defined by sources, updated via derivations). Some objects are sinks (defined by sinks). For each bytecode instruction, update what we know about our objects. Within a function, may need to keep track of local variables. Within a class, may need to keep track of object fields. What's the pointer analysis problem? Analyzing statically: no actual objects, and no actual pointer values Need to make up names to talk about objects that might exist at runtime Given a pointer variable, need to know what objects it might point to Compare this tool with XFI's verifier. More localized analysis: no heap to reason about, just scoped stack. Easy way out: program does not verify, so not a legal XFI module. How does static analysis name objects? Typically, allocation site. E.g. StringBuffer sb = new StringBuffer(); String s = new String("abc"); For analysis, all StringBuffer objects allocated on 1st line are identical. What kinds of problems can this lead to? Might be allocating objects that end up both tainted & untainted. Will get false positives as a result. Example: containers allocate objects internally As a result, objects allocated inside container are all identical Paper says: Vector, HashMap, ... How does this paper propose augmenting the naming scheme? Why? For objects allocated in container, name it by caller of container Why not augment naming for every kind of object, not just containers? Presumably cost of analysis: many more objects to consider! Fundamentally, static analysis relies on summarizing to scale. Precision is probably not going to be affected: analysis is sound. What's context-sensitive vs. context-insensitive pointer analysis? Context-insensitive: one set of assumptions about execution of each function e.g. what objects the variables in a functions point to Context-sensitive: clone the function based on caller (maybe recursively) e.g. during analysis, separate Vector constructor for each caller How might you get false positives? Checks that are not on data path e.g. white-listing or black-listing paper actually has an almost-example: blojsom programmer tried to check, but didn't get it quite right False or imprecise sinks (which are sometimes not a threat) Impossible to exploit vulnerability due to imprecise analysis e.g. objects are aliased due to same allocation site e.g. static analysis cannot reason about control flow Reused objects like StringBuffer Imprecise object names e.g. more containers or helper functions making allocations How might you get false negatives? Missing sinks Missing derivation rules e.g. control-flow-based derivation e.g. derivation through int or char objects (individual string chars) e.g. data flowing in & out of application unchecked input written to database, exploits path traversal attack unchecked input written to database, exploits cross-site scripting Sanitization function only safe when applied in a certain order e.g. need to perform HTTP-escaping or HTML-escaping on some pathname sanitization function removes all instances of '..' but if sanitize before decoding, attacker can hide the '..' bytes Might need to write more complex rules to track this Perhaps need two kinds of taint (pre-decoding and post-decoding) Then, sanitization function propagates only pre-decoding taint Vulnerabilities that are not unchecked-input errors Vulnerabilities that are outside the checked application e.g. cross-site scripting bugs in HTTP server itself Do you think buffer overflows are an unchecked input problem? Could we formulate a description of buffer overflows with this tool? (Assuming this tool worked for something like C, which seems very hard..) Do their techniques matter? I.e., context-sensitive analysis and better naming? Figure 10(b) seems to suggest yes. What are the remaining false positives? Another aliased allocation site. After adding improved naming for it, no false positive. What kinds of programs can you analyze with this tool? Moderate-size Java applications -- about 1000 Java classes. Performance Performance seems OK, probably can do even better now (5 years later). Why does the analysis run faster for improved naming? Why does the analysis sometimes run slower or faster for context-sensitive? Authors say more precise analysis means less things to track. Context-sensitivity adds overhead: considers more function copies. Improved naming in itself mostly just improves precision. What do you think about this paper / system?