6.858: TAINT TRACKING

What problem does the paper try to solve?
   -Applications can exfiltrate a user's private
    data and send it to some server.
   -High-level approach: keep track of which
    data is sensitive, and prevent it from
    leaving the device!
   -Why aren't Android permissions enough?
        *Android permissions control whether
         application can read/write data, or
         access devices or resources (e.g.,
         the Internet).
        *Using Android permissions, it's hard
         to specify a policy about *particular*
         types of data (Ex: "Even if the app
         has access to the network, it should
         never be able to send user data over
         the network").
        
        Q: Aha! What if we never install apps
           that both read data *and* have network
           access?
        A: This would prevent some obvious leaks,
           but it would also break many legitimate
           apps! [Ex: email app]
           
           Information can still leak via side
           channels. [Ex: browser cache leaks
           whether an object has been fetched
           in the past]
           
           Apps can collude! [Ex: An app without
           network privileges can pass data to
           an app that does have network
           privileges.]

           A malicious app might trick another
           app into sending data. [Ex: Sending
           an intent to the Gmail app?]


What does Android malware actually do?
   -Use location or IMEI for advertisements.
    [IMEI is a unique per-device identifier.]
   -Credential stealing: send your contact list,
    IMEI, phone number to remote server.
   -Turn your phone into a bot, use your contact
    list to send spam emails/SMS messages!
    [Ref: http://www.bbc.com/news/technology-30143283]
   
   Preventing data exfiltration is useful, but
   taint tracking by itself is insufficient to
   keep your device from getting hacked!


TaintDroid tracks sensitive information as it
propagates through the system.
   -TaintDroid distinguishes between information
    sources and information sinks
       Sources generate sensitive data:
          Ex: Sensors, contacts, IMEI
       Sinks expose sensitive data:
          Ex: network.
    TaintDroid uses a 32-bit bitvector to
    represent taint, so there can be at most
    32 distinct taint sources.
   -Roughly speaking, taint flows from rhs
    to lhs of assignments.
       int lat = gps.getLatitude();
                    //The lat variable is now
                    //tainted!
    
    Dalvik VM is a register-based machine,
    so taint assignment happens during the
    execution of Dalvik opcodes [see Table 1].
       move_op dst src            //dst receives src's taint
       binary_op dst src0 src1  //dst receives union of src0
                                          //and src1's taint
    
   -Interesting special case: arrays
       char c = //. . . get c somehow.
       char uppercase[] = ['A', 'B', 'C', . . .];
       char upperC = uppercase[c];
                         //upperC's taint is the
                         //union of c and uppercase's
                         //taint.
    To minimize storage overheads, an array
    receives a single taint tag, and all of
    its elements have the same taint tag.
    
    Q: Why is it safe to associate just one
       label with arrays or IPC messages?
    A: It should be safe to *over*-estimate
       taint. This may lead to false positives,
       but not false negatives.


   -Another special case: native methods
    (i.e., internal VM methods like
    System.arraycopy(), and native code
    exposed via JNI).
       *Problem: Native code doesn't go
        through the Dalvik interpreter, so
        TaintDroid can't automatically
        propagate taint!
       *Solution: Manually analyze the
        native code, provide a summary of
        its taint behavior.
            -Effectively, need to specify
             how to copy taints from args
             to return values.
            
            Q: How well does this scale?
            A: Authors argue this works OK
               for internal VM functions
               (e.g., arraycopy). For "easy"
               calls, the analysis can be
               automated---if only integers
               or strings are passed, assign
               the union of the input taints
               to the return value.


   -IPC messages are like treated like
    arrays: each message is associated
    with a single taint that is the union
    of the taints of the constituent
    parts.
        *Data which is extracted from an
         incoming message is assigned
         the taint of that message.
   -Each file is associated with a single
    taint flag that is stored in the
    file's metadata.
        *Like with arrays and IPC messages,
         this is a conservative scheme that
         may lead to false positives.


How are taint flags represented in memory?
   -Five kinds of things need to have taint
    tags:
        1) Local variables in a method -+___ live on stack__
        2) Method arguments            -+
        3) Object instance fields
        4) Static class fields
        5) Arrays
   -Basic idea: Store the flags for a variable
    near the variable itself.
        Q: Why?
        A: Preserves spatial locality---this
           hopefully improves caching behavior.


    For method arguments and local variables
    that live on the stack, allocate the
    taint flags immediately next to the
    variable.
                 .
                 .
        |        .         |
        +------------------+
        |     local0       |
        +------------------+
        | local0 taint tag |
        +------------------+
        |     local1       |
        +------------------+
        | local1 taint tag |
        +------------------+
                 .
                 .
                 .


    TaintDroid uses a similar approach
    for class fields, object fields,
    and arrays---put the taint tag
    next to the associated data.
    
So, given all of this, the basic idea in
TaintDroid is simple: taint sensitive data
as it flows through the system, and raise
an alarm if that data tries to leave via
the network!
   -The authors find various ways that
    apps misbehave. Ex:
        *Sending location data to advertisers
        *Sending a user's phone number to
         the app servers


TaintDroid's rules for information flow might lead
to counterintuitive/interesting results.
   -Imagine that an application implements its
    own linked list class. 
        class ListNode{
            Object data;
            ListNode next;
        }
    
    Suppose that the application assigns tainted
    values to the "data" field. If we calculate
    the length of the list, is the length value
    tainted?

   -Adding to a linked list involves:
        1) Allocating a ListNode
        2) Assigning to the "data" field
        3) Patching up "next" pointers
    Note that Step 3 doesn't involve tainted
    data! So, "next" pointers are tainted,
    meaning that counting the number of
    elements in the list would not generate
    a tainted value for length.


What are the performance overheads of TaintDroid?
   -Additional memory to store taint tags.
   -Additional CPU cost to assign, propagate,
    check taint tags.
   -Overheads seem to be moderate: ~3--5%
    memory overhead, 3--29% CPU overhead
        *However, on phones, users are very
         concerned about battery life: 29%
         less CPU performance may be
         tolerable, but 29% less battery
         life is bad.

Q: Why not track taint at the level of
   x86 instructions or ARM instructions?
A: It's too expensive, and there are
   too many false positives.
        *Ex: If kernel data structures are
         improperly assigned taint, then
         the taint will improperly flow
         to user-mode processes. This
         results in taint explosion: it's
         impossible to tell which state
         has *truly* been affected by
         sensitive data.

         One way that this might happen is
         if the stack pointer or the break
         pointer are incorrectly tainted.
         Once this happens, taint rapidly
         explodes:
              -Local variable accesses are
               specified as offsets from
               the break pointer.
              -Stack instructions like pop
               use the stack pointer.
              -[Ref: http://www.ssrg.nicta.com.au/publications/papers/Slowinska_Bos_09.pdf]


Q: Taint tracking seems expensive---can't we
   just examine inputs and outputs to look
   for values that are known to be sensitive?
A: This might work as a heuristic, but it's
   easy for an adversary to get around it.
        *There are many ways to encode data,
         e.g., URL-quoting, binary versus
         text formats, etc.


As described, taint tracking cannot detect
implicit flows.
   -Implicit flows happen when a tainted
    value affects another variable without
    directly assigning to that variable.
         if(imei > 42){
             x = 0;
         }else{
             x = 1;
         }
    Instead of assigning to x, we could
    try to leak information about the
    IMEI over the network!
   -Implicit flows often arise because of
    tainted values affecting control flow.
        *Can try to catch implicit flows
         by assigning a taint tag to the
         PC, updating it with taint of
         branch test, and assigning PC
         taint to values inside if-else
         clauses, but this can lead to
         a lot of false positives. Ex:
         
         if(imei > 42){
             x = 0;
         }else{
             x = 0;
         }
         //The taint tracker thinks that
         //x should be tagged with imei's
         //taint, but there is no information
         //flow!


Interesting application of taint tracking:
keeping track of data copies.
   -Often want to make sure sensitive data
    (keys, passwords) is erased promptly.
   -If we're not worried about performance,
    we can use x86-level taint tracking to
    see how sensitive information flows
    through a machine.
    [Ref: http://www-cs-students.stanford.edu/~blp/taintbochs.pdf]
   -Basic idea: Create an x86 simulator that
    interprets each x86 instruction in a full
    system (OS + applications).
   -You'll find that software often keeps data
    for longer than necessary. For example,
    keystroke data stays around in:
        *Keyboard device driver's buffers
        *Kernel's random number generator
        *X server's event queue
        *Kernel socket/pipe buffers used to
         pass messages containing keystroke
        *tty buffers for terminal apps
        *. . . etc . . .


TaintDroid detects leaks of sensitive data,
but requires language support for the Java
VM---the VM must implement taint tags. Can
we track sensitive information leaks without
support from a managed runtime? What if we
want to detect leaks in legacy C or C++
applications?
   -One approach: use doppelganger processes
    as introduced by the TightLip system.
    [Ref: https://www.usenix.org/legacy/event/nsdi07/tech/full_papers/yumerefendi/yumerefendi.pdf]
   -Step 1: Periodically, Tightlip runs a
    daemon which scans a user's file system
    and looks for sensitive information like
    mail files, word processing documents,
    etc.
         *For each of these files, Tightlip
          generates a shadow version of the
          file. The shadow version is
          non-sensitive, and contains
          scrubbed data.
         *Tightlip associates each type of
          sensitive file with a specialized
          scrubber. Ex: email scrubber
          overwrites to: and from: fields
          with an equivalent number of
          dummy characters.
   -Step 2: At some point later, a process
    starts executing. Initially, it touches
    no sensitive data. If it touches sensitive
    data, then Tightlip spawns a doppelganger
    process.
         *The doppelganger is a sandboxed
          version of the original process.
             -Inherits most state from the
              original process . . .
             -. . . but reads the scrubbed
              data instead of sensitive data
         *Tightlip lets the two processes
          run in parallel, and observes
          the system calls that the two
          processes make.
         *If the doppelganger makes the
          same system calls with the same
          arguments as the original process,
          then with high probability, the
          outputs do not depend on sensitive
          data.
   -Step 3: If the system calls diverge,
    and the doppelganger tries to make a
    network call, Tightlip flags a potential
    leak of sensitive data.
         *At this point, Tightlip or the user
          can terminate the process, fail the
          network write, or do something else.


   -Nice things about Tightlip:
         *Works with legacy applications
         *Requires minor changes to standard
          OSes to compare order of system
          calls and their arguments
         *Low overhead (basically, the overhead
          of running an additional process)
   -Limitations of Tightlip
         *Scrubbers are in the trusted computing
          base.
              -They have to catch all instances
               of sensitive data.
              -They also have to generate
               reasonable dummy data---otherwise,
               a doppelganger might crash on
               ill-formed inputs!
         *If a doppelganger reads sensitive data
          from multiple sources, and a system
          call divergence occurs, Tightlip can't
          tell why.

TaintDroid and Tightlip assume no assistance
from the developer . . . but what if developers
were willng to explicitly add taint labels to their
code?
      int {Alice --> Bob} x;  //Means that x is controlled
                                      //by the principal Alice, who
                                      //allows that data to be seen
                                      //by Bob.

      Input channels: The read values get the label
      of the channel.
      
      Output channels: Labels on the channel must match
      a label on the value being written.

   -Static (i.e., compile-time) checking can catch
    many bugs involving inappropriate data flows.
           *Loosely speaking, labels are like strong
             types which the compiler can reason about.
           *Static checks are much better than dynamic
             checks: runtime failures (or their absence)
             can be a covert channel!
   -For more details, see the Jif paper: http://pmg.csail.mit.edu/papers/iflow-sosp97.pdf