Static Analysis
===============

What's the goal of this paper?
    Static analysis of Java applications
    Find vulnerabilities by analyzing application code, without running it
    Vulnerabilities here are some known class of programming mistakes

What are advantages/disadvantages of static analysis, what are alternatives?
    Alternative: better development practices, careful coding (hard for people)
    Alternative: write test cases, or regression tests (good idea, incomplete)
    Alternative: fuzz-testing, probe with random inputs (effective, incomplete)
	More general forms: dynamic analysis, symbolic execution, ..
    + for static analysis: can be complete, guarantee no errors of certain kind
    + for static analysis: can run easily at development time, and fix errors
	(as opposed to dynamic analysis much later, may be too late to fix)
    - for static analysis: false positives, doesn't generate a real exploit
    - for static analysis: may be expensive (CPU time) to run

What kinds of vulnerabilities is this paper interested in?
    'Unchecked input vulnerabilities'
    Basic idea: input came from user (could be adversary), used in unsafe way
    Example attacks that fit this pattern:
	SQL injection
	Cross-site scripting
	Path traversal
	HTTP response splitting
    For each attack:
	Where is the input coming from?
	Where is it getting used in an unsafe way?
    One interesting corner case:
	whois-based XSS attack in phpBB

Why are unchecked input vulnerabilities particularly common?
    Input comes from user in some function A()
    Input value gets used (potentially unsafely) in another function B()
    In A(), programmer forgets or doesn't know value will be used unsafely
    In B(), programmer forgets or doesn't know if value came from the user
    Why not always check in A()?
    Why not always check in B()?

How would programs avoid these vulnerabilities?
    White-listing
    Black-listing
    Sanitization
    Look at examples of each approach for the 4 attacks we consider above

High-level plan for static analysis:
    Notion of 'tainted' objects: objects that contain unchecked inputs
    What does a specification of a vulnerability look like?
	Three 'descriptors' (pattern-matching rules) define the vulnerability
	Sources
	Sinks
	Derivations
    Source descriptors specify what data is assumed 'tainted' to start with
    Derivation rules: how one tainted object can impart taint on another object
    Goal is to determine if a sink descriptor can ever be tainted

Can you share these specifications between vulnerabilities?
    Sources: probably a good idea to use the same rules.
	Base sources: reading POST data, GET arguments, request path, headers...
	Want to get all untrusted sources in an application.
	Adversary controls the same inputs regardless of the vulnerability.
    Sinks: probably application-specific.
	E.g. opening a file for path traversal.
	     SQL query for SQL injection.
	     outputting HTML to user for cross-site scripting.
    Derivations: good to share, but might need to be app-specific.
	Base derivation statements:
	    StringBuffer.append(String), StringBuffer.toString()
	What happens to the standard Java string concatenation with +?
	    Compiler rewrites s1+s2 into:
		new StringBuffer().append(s1).append(s2).toString()
	What if you implement your own string copy function?
	When would you need to write specialized derivation functions?
	    Byte-level copying between strings (base derivations don't track it)
	    Native code that copies around data (maybe for efficiency?)
	    Control-flow-based copying of tainted data
	    Common cases: switch statement for each character of a string
			  array lookup for each character of a string
	    Need derivation statement for any function that does this

What does this static analysis tool analyze?
    This paper's tool analyzes Java bytecode.
    Java's bytecode is sufficiently type-safe to make it a good target.
    For other languages (like C, C++, ..) may want source code.
	Why?  Harder to find (likely) invariants in assembly, easier in source.
	E.g. if/then/else branches, distinct struct members, ..

What happens during static analysis?
    Analyzes the Java bytecode in a function.
    Considers the set of objects that might exist at runtime.
	Some objects are tainted (defined by sources, updated via derivations).
	Some objects are sinks (defined by sinks).
    For each bytecode instruction, update what we know about our objects.
	Within a function, may need to keep track of local variables.
	Within a class, may need to keep track of object fields.
    What's the pointer analysis problem?
	Analyzing statically: no actual objects, and no actual pointer values
	Need to make up names to talk about objects that might exist at runtime
	Given a pointer variable, need to know what objects it might point to
    Compare this tool with XFI's verifier.
	More localized analysis: no heap to reason about, just scoped stack.
	Easy way out: program does not verify, so not a legal XFI module.

How does static analysis name objects?
    Typically, allocation site.
    E.g. StringBuffer sb = new StringBuffer();
	 String s = new String("abc");
    For analysis, all StringBuffer objects allocated on 1st line are identical.
    What kinds of problems can this lead to?
	Might be allocating objects that end up both tainted & untainted.
	Will get false positives as a result.
	Example: containers allocate objects internally
	    As a result, objects allocated inside container are all identical
	    Paper says: Vector, HashMap, ...
    How does this paper propose augmenting the naming scheme?  Why?
	For objects allocated in container, name it by caller of container
    Why not augment naming for every kind of object, not just containers?
	Presumably cost of analysis: many more objects to consider!
	Fundamentally, static analysis relies on summarizing to scale.
	Precision is probably not going to be affected: analysis is sound.

What's context-sensitive vs. context-insensitive pointer analysis?
    Context-insensitive: one set of assumptions about execution of each function
	e.g. what objects the variables in a functions point to
    Context-sensitive: clone the function based on caller (maybe recursively)
	e.g. during analysis, separate Vector constructor for each caller

How might you get false positives?
    Checks that are not on data path
	e.g. white-listing or black-listing
	     paper actually has an almost-example:
	     blojsom programmer tried to check, but didn't get it quite right
    False or imprecise sinks (which are sometimes not a threat)
    Impossible to exploit vulnerability due to imprecise analysis
	e.g. objects are aliased due to same allocation site
	e.g. static analysis cannot reason about control flow
    Reused objects like StringBuffer
    Imprecise object names
	e.g. more containers or helper functions making allocations

How might you get false negatives?
    Missing sinks
    Missing derivation rules
	e.g. control-flow-based derivation
	e.g. derivation through int or char objects (individual string chars)
	e.g. data flowing in & out of application
	     unchecked input written to database, exploits path traversal attack
	     unchecked input written to database, exploits cross-site scripting
    Sanitization function only safe when applied in a certain order
	e.g. need to perform HTTP-escaping or HTML-escaping on some pathname
	     sanitization function removes all instances of '..'
	     but if sanitize before decoding, attacker can hide the '..' bytes
	Might need to write more complex rules to track this
	Perhaps need two kinds of taint (pre-decoding and post-decoding)
	Then, sanitization function propagates only pre-decoding taint
    Vulnerabilities that are not unchecked-input errors
    Vulnerabilities that are outside the checked application
	e.g. cross-site scripting bugs in HTTP server itself

Do you think buffer overflows are an unchecked input problem?
    Could we formulate a description of buffer overflows with this tool?
    (Assuming this tool worked for something like C, which seems very hard..)

Do their techniques matter?
    I.e., context-sensitive analysis and better naming?
    Figure 10(b) seems to suggest yes.
    What are the remaining false positives?
	Another aliased allocation site.
	After adding improved naming for it, no false positive.

What kinds of programs can you analyze with this tool?
    Moderate-size Java applications -- about 1000 Java classes.

Performance
    Performance seems OK, probably can do even better now (5 years later).
    Why does the analysis run faster for improved naming?
    Why does the analysis sometimes run slower or faster for context-sensitive?
	Authors say more precise analysis means less things to track.
	Context-sensitivity adds overhead: considers more function copies.
	Improved naming in itself mostly just improves precision.

What do you think about this paper / system?