Privacy
=======

Administrivia.
  Quiz this coming Monday, moved to Walker (50-340).
    Covers lectures 1-12 (up to ForceHTTPS) and labs 1-3.
    Open laptop, notes, etc.
    No Internet access allowed.
  Quiz review this evening, in 34-101.
  Final project blurbs due in Piazza tomorrow.
    Doesn't have to be fully baked, and you don't have to commit to it yet.
    Comment on other students' ideas that you find interesting.
    Feel free to switch topics if you like another idea.
  Relevant research talk today.
    Wil Robertson (NEU): PrivExec, an extension of Linux to enforce privacy.
    4pm in 32-G882.
    [ Ref: http://nms.csail.mit.edu/sys-security/details.html#Oct2313 ]

What is the goal of privacy?
  Ideal: (activity of) a given user is indistinguishable from
         (activity of) many other users.
  Paper looks at browser support for "private browsing".
    Recent feature, introduced by many browsers several years ago.
    One example of how "security" features are introduced in browsers.
      No universal agreement on what the feature means.
      No formal guarantees, "best-effort" in some sense.
      Evolves over time in response to user demand and what other browsers do.
        Safari now has a more visible UI, tries to address web attacker.
      As users rely on it more, and expect strong guarantees, problems show up.
  Goal:
    User 1 -- Browser -- Computer -- ISP --\
                                           Internet -- Web site
    User 2 -- Browser -- Computer -- ISP --/
  Reality:
    User -- Browser with 2 modes -- Computer -- ISP -- Internet -- Web site

What do the browsers mean by "private browsing"?
  Paper formalizes this as two threat models / attacks.
  Why split the threat model into two?
    Ideal goal remains the same, but hard to achieve.
    If two attackers collude, becomes easier for them to identify user.
      E.g., local attacker asks server to check for IP address in access logs.
    Some practical value to security against these two attacks in isolation.

Threat 1: Local attacker.
  Assumptions:
    Gets control of user's machine.
    Wants to learn what the user did in private browsing mode.
  Goal:
    Should not learn what the user did in past private browsing.
  Non-goal:
    Achieve privacy for future private browsing.
      Attacker can modify software on machine, log all future browsing.
      E.g., install a keystroke logger.
    Hide the fact that private browsing was used.
      Often called "plausible deniability".
      Why does the paper claim this is more difficult to achieve?
        Not fully clear: maybe access times on program files?

Data lifetime demo.
  host% cd ubuntu-vm/vm
  host% ./run.sh

  Login: demo/demo
  Open Firefox, visit http://www.stanford.edu/
  Ctrl-Shift-P, visit http://pdos.csail.mit.edu/
  Close both browsers.

  host% gcore $(pgrep qemu-system)
  host% strings core.* | grep -i stanford
  host% strings core.* | grep -i pdos

  host% strings temp.qcow2 | grep -i stanford
  host% strings temp.qcow2 | grep -i pdos

  Try to force data out of memory:
  guest% ./memuse
  Re-run gcore, strings core.*; is there still data left?

  What if we force data out of memory while private browsing is running?
  Re-run strings temp.qcow2; is there data on disk now?

Data lifetime is a broader problem than just private browsing.
  E.g., cryptographic keys or passwords might be problematic if disclosed.
  [ Ref: http://css.csail.mit.edu/6.858/2010/readings/chow-shredding.pdf ]

  Demo 2:
    host% cd ubuntu-vm
    host% cat memclear.c
    host% cat secret.txt
    host% make memclear
    host% ./memclear &
    host% gcore $(pgrep memclear)
    host% strings core.* | grep secret

    host% objdump -d memclear | less
    ## look for read_secret

How does data persist?
  Process memory: heap, stack.
    Terminal scrollback.
    I/O buffers, X event queues, DNS cache, proxy servers, ...
    Language runtime makes copies (immutable strings in Python).
  Files, file backups.
  Swapped memory, hibernate for laptops.
  Kernel memory.
    IO buffers: keyboard, mouse inputs.
    Freed memory pages.
    Network packet buffers.
    Pipe buffers contain data sent between processes.
    Random number generator inputs (including keystrokes again).

How could an adversary get a copy of leftover data?
  Files themselves contain many versions (e.g., Word documents).
  Programs reveal uninitialized memory.
    Forgetting to zero out allocated memory.
    Word files used to contain uninitialized portions, often old text.
  Core dumps.
  Direct access to the machine.
  Flash SSDs implement logging, don't erase old data right away.
  Stolen disks, or just disposing of old disks.
    [ Ref: http://news.cnet.com/2100-1040-980824.html ]

How to deal with data lifetime problems?
  Zero out unused memory.  (Some performance hit.)
  Encrypt data in places where zeroing out is difficult (e.g., on an SSD).
    Securely deleting the key means data cannot be decrypted anymore.
    E.g., OpenBSD swap uses encryption: new key at each bootup.
    CPU cost of encryption is modest compared to disk I/O.

Threat 2: Web attacker.
  Assumptions:
    Controls the web sites that the user visits.
    Does not control the user's machine.
    Wants to identify the user.
  Goal:
    Attacker should not be able to identify the user.
    Attacker should not be able to determine if user is using private browsing.

Most browsers failed to satisfy the second goal (detecting private browsing).
  Hard to reconcile with how browsers originally implemented private browsing.
    Can detect if history or cookies aren't working.
  Possible to solve in principle.
    Create a temporary history database, cookie database, etc.
    Chrome is closer to this design now for some state (e.g., cookies).
    Still no history within a single private browsing session.

What does it mean to identify a user?
  Link visits by the same user from different private browsing sessions.
  Link visits by user from private browsing and public browsing sessions.

Easy way to identify user: IP address.
  With reasonable probability, requests from same IP address are the same user.
  In two weeks, we will talk about Tor.
  Tor protects the privacy of the source of a TCP connection (i.e., user's IP).
  Tor must still worry about other problems private browsing tries to address.

Browser fingerprinting demo.
  Open Chrome, go to http://panopticlick.eff.org/
  Ctrl-Shift-N, open the same web site in private browsing mode.
  Good way to think of privacy: anonymity set of a user.
    What's the largest set of users among which some user is indistinguishable?

How to provide strong guarantees for private browsing?
  (Ignoring IP address privacy for now, or assume we'll combine this with Tor.)
  Separate computers, or VM-level privacy.
    Overall plan:
      Run each private browsing session in a separate VM.
      Ensure that the VM is deleted after private browsing is done.
      Somehow make sure that no VM state ends up on disk.
    Advantages: 
      Strong guarantee against both a local attacker and a web attacker.
      No changes required to application, just need secure deletion of VM.
    Going back to the earlier picture:
         /-- Browser -- VM --\
      User                   Computer -- ISP -- Internet -- Web site
         \-- Browser -- VM --/
    Drawbacks:
      Spinning up a separate VM for private browsing is heavyweight.
      Harder for user to save files from private browsing, use bookmarks, etc.
        In this sense, inherent trade-off between usability and privacy.

  OS-level.
    Implement similar guarantees at the OS kernel level.
    A process can run in a "privacy domain", which will be deleted afterwards.
    Advantages over VM: lighter-weight.
    Drawbacks: harder to get right (OS kernel manages lots of state).
    (PrivExec talk later today is about such a design.)

Are there ways to de-anonymize the user with a VM-level approach?
  Maybe the VM itself is unique.  Need to ensure similar VM to other users.
  Maybe host computer introduces some uniqueness.
    TCP fingerprinting, especially if doing NAT.
    Effects introduced by the virtual machine monitor?
  The user is still shared in the VM picture.
    Detect user's keystroke timing.
    Detect user's writing style.
      Usually called stylography.
      [ Ref: http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/ ]

Why do browsers implement their own private browsing support?
  Compatibility: no need to rely on VM or OS mechanisms.
    Similar to motivations for Native Client.
  Usability: want to allow some state changes to stay between sessions.
    Inherently dangerous plan!
  Overall picture:
       /-- Browser session 1 --\
    User                       Computer -- ISP -- Internet -- Web site
       \-- Browser session 2 --/
  Key problem: determining how to manage state between different sessions.

How does the paper propose to think about these state changes?
  Depends on who initiated the state changes (section 2.1).
  Categories:
    1. Initiated by web site, no user interaction: stays within session.
      Cookies, history, cache.
    2. Initiated by web site, requires user interaction: unclear.
      Client certificates, password manager, custom protocol handlers.
    3. Initiated by user: unclear.
      Bookmarks, file downloads.
    4. Unrelated to a session: treat as a single global state.
      Browser updates, certificate revocation list updates.

What do browsers actually implement?
  Each browser is, of course, different.
  Moreover, some state "bleeds over" in one direction but not another!
    Not a strict partitioning between private mode and public mode state.
  What happens if public state bleeds over into private state?
    Easier for web attacker to succeed: link private session to public session.
    E.g., all of the items in Table 1.
    Why are they set to "yes"?
      Mostly things in category 2, 3.
      Requires user to expose that state in private mode.
      Assumes user is aware of implication (e.g., using autocomplete value).
    One bad case: Firefox CPH: can access (read) without user involvement.
  What happens if private state bleeds over into public state?
    Both the web attacker and local attacker succeed.
    Web attacker: observe public session contains state from private session.
    Local attacker: observe state from private session.
    Examples shown in Table 2.
    Why are there "yes" entries?
      Requires user interaction to make a state change.
      Assumes user is aware of implication (e.g., save bookmark).
    Possibly bad case: downloaded items (Chrome might auto-download).
    Why are some items asymmetric with Table 1?
      Reasonable user involvement in reading state, less involvement in update.
        E.g., autocomplete, search box history.
  What should happen to state while user remains in a private mode session?
    Most browsers allow state to persist within a private mode session.
    Table 3.
    What does a "no" entry mean?
      Can detect private mode browsing.
    Why is it OK to allow cookies in private browsing mode?
  What should happen to state across private mode sessions?
    Should we expect another Table 4?
    Probably not: just a combination of Table 2 -> Table 1.
    Think of each private mode session sharing state with single public mode.
  How to think of the IE integration with SMB bug?
    Browser will send user's Windows username to any computer on the internet.
    Windows username becomes part of the browser state?
  Why are certificates a special case?
    Analogy with passwords: user can choose to log in with their password while
      in private browsing mode, if they want to, or create an account.
    With certificates, browser MUST be involved: user alone cannot supply cert.
    Browsers require user interaction to either create (modify state), or use
      (allow web site to read state).
    E.g., Chrome is much more aggressive about asking user if OK to use cert.
      In normal mode, automatically sends certificate if site visited before.
      In private mode, Chrome prompts whether to send cert for every HTTP req.

Why are extensions / plugins special?
  Privileged code.
    Can access shared state.
    Not subject to same-origin policy or other browser security checks.
  Developed by someone other than the browser vendor.
  Might not be aware of private mode, or mis-implements the intended policy.

How to test for bugs in private mode implementation of browsers and extensions?
  Try to find a way to affect public state from private mode.
    Save a copy of the browser's state (files).
    Run various workloads while in private mode.
      Browser unit tests, load pages that use some extension/plugin, etc.
    Exit private mode and see if the resulting state differs.
      Difference might signal that private mode was used (easy to check once).
      Difference might also signal what happened within private mode.
  What kinds of bugs does this catch?
    Problems from category 1 (initiated by web site, no user interaction).
    E.g., discovered that Firefox stores CA chains observed in private mode.
  What kinds of bugs does this fail to catch?
    Won't catch subtle ways adversary might try to affect public state.
    Won't catch cases when user involvement needed, and user makes mistakes.

How to find cases with user involvement?
  Manual audit of calls to update state in the browser or extension source code.
  Found some examples; whether it's a problem depends on the user.
    Setting custom protocol handlers in Firefox (e.g., mailto:).
    Popup-blocking or ad-blocking preferences.
    Getting SSL client certificates.