Security architecture
=====================

This lecture: how to think about systems security, at an architectural level.
  Real-world example: Google's security architecture.
    Paper written with a focus on Google's cloud platform offering.
    Does not describe all security aspects of all Google services.
    Nonetheless, an impressive, pithy, and broad overview of Google's security.
      Many paragraphs can each be a full lecture / paper in their own right..
    Good overview of what a security architecture can look like.
  Overall principles: Butler Lampson's talk.
    [[ Ref: http://css.csail.mit.edu/6.858/2015/lec/lampson.pdf ]]

Should Google publish a document like this?
  Is it a good idea to assume adversary doesn't know the design?
  Hard to keep a design secret (contrast with a key, password, ...).
  Hard to scale to many engineers with a secret design.
    Can't tell too many engineers about the design!
  Hard to recover security after compromise.
    Would need to come up with a new secret design.

What are the security goals here?
  Avoid disclosure of customer data (e.g., gmail).
    And especially so for deleted data.
  Ensure availability of Google's applications.
  Track down what went wrong if a compromise occurs.
  Help Google engineers build secure applications.
    This document focuses on the infrastructure.
  Broadly, ensure customers trust Google.

Worried about many threats:
  Software bugs
  Configuration mistakes
  Compromised networks (customer, Internet, Google's internal)
  Physical compromises of some data centers
  Guessable user passwords
  Targeted phishing attacks (against users, against employees)
  Compromised developer workstations ("employee client devices")
  Insider attacks (bribing an engineer or data center operator)
  Hardware trojans
  Governments
  Adversaries getting access to old disks

Threat model (assumptions about the world) not precisely pinned down
  Not the goal of this document
  In a large system like Google's, likely a range of assumptions
    Probably a more precise, smaller set of assumptions for user auth
    Less pinned down for Google search, maybe?
  Can probably guess some assumptions
    Someone trustworthy audits code changes
    Base isolation mechanisms (e.g., VMs) are correctly implemented
  Even with isolation, a range of reliance on that assumption
    Key management runs on separate machines
    Both avoids VM bugs and avoids dependency issues when data center reboots

Rest of the lecture: what are the policies and mechanisms?

What is Google's model of running applications?
  Data centers.
  Physical machines.
  Virtual machines.
  Service comprised of several VMs implementing the same logical API.
  RPC between services, authentication / authorization checks.
  Front-end servers convert HTTP/HTTPS into RPC.

Basis of security: isolation.
  Crucial to have a host that provides an isolated execution environment.
  Host is trusted to implement isolation correctly.

Examples of isolation in Google's design?
  [[ "Service Identity, Integrity, and Isolation" ]]
  Linux user separation.
  Language sandboxes.
  Kernel sandboxes.
  Hardware virtualization.
  Physical isolation (separate machines), for particularly sensitive services.
    What attacks are they worried about?
    Isolation failures: host bugs, side channels, ..

  What is the trusted host for each of these isolation plans?
  Even physical isolation is not without trust assumptions.
    Shared network.
    Physical access (some tidbits about Google's physical security).
    Correct hardware (vet hardware designers).
  Even air gaps have a non-trivial (and imperfect) host: TEMPEST.

  We will explore a range of isolation mechanisms in later lectures.

Sharing: guard / reference monitor.
                              +------------------------------+
                              |   Policy                     |
                              |      |                       |
                      request |      V                       |
  Subject/principal ----------|--> GUARD --> Object/resource |
                              |      |                       |
                              | +-----------+                |
                              | |    |      |                |
                              | |    V      |                |
                              | | Audit log |                |
                              | +-----------+                |
                              +------------------------------+
                                  HOST enforcing isolation

  Requires a secure channel between the subject and the guard.
    One way to do this is to use crypto.
    Can also rely on host (e.g., OS kernel) for secure channel within a machine.

What does the guard do?
  "Gold standard": authenticate, authorize, audit.

Authenticate: determine who is issuing the request.
  Principal can refer to a person.
  Can also refer to other entities: machines, services, admins, ..

How to authenticate a person?
  Passwords.
  Two-factor authentication.
  Phishing attacks?
  How does U2F defeat phishing?
    Naive sketch of a possible U2F-like protocol:
      S -> C: random secret token
      C -> S: Sign_U2Fdongle(random secret token || server principal)
    Trust browser to identify server principal.
  Adaptive authentication.
  Biometrics.
    Good in physical access situations: trusted scanner.
    Not great for remote authentication.
  Next lecture will talk a lot more about how to authenticate people.

Authorize: determine if the request should be allowed.
  Policy function: permissions = POLICY(subject, object).
  Equivalent: access matrix.
  Two typical ways to store policy.
    ACLs: store a row (slice by object).
    Capabilities: store a column (slice by subject).
  ACLs widely used for long-term policy.
    File permissions, list of users with access to a shared doc, ..
    Good at answering "who can access this resource?"
  Capabilities useful for short-term policy.
    File descriptors in an OS; objects in languages;
      cryptographic tokens in distributed protocols.
    Not great for long-term policy:
      Can't answer "who has access?"
      Revocation is tricky.
    Nice property: decouples object/resource from policy.
      Useful in distributed systems.

Examples of authorization plans in Google's design?
  [[ "Inter-Service Access Management" ]]
  Service owner writes policy for what principals can invoke each API.
    Other services, engineers, machines, ..
  Services implement custom access control.
    Google provides centralized ACL and group databases to help with this.
  Must pass "end-user permission ticket" (e.g., to access Contacts service).
    Capability to perform operations on behalf of an end-user.
    Ticket is short-lived.
    User credential is long-lived but not passed to contacts service (why?).

Who can set the policy?
  Mandatory access control (MAC): administrator sets policy.
    E.g., set ACLs on what services can invoke what APIs.
    Even if a service is buggy or misconfigured, won't run unwanted RPCs.
  Discretionary access control (DAC): object owner can set policy.
    E.g., end-user can share Google Docs document.
    Service implements check of "end-user permission ticket".
  [[ Refinement: RBAC: app designer helps administrator set policy ]]

Audit: keep track of what happened.
  Useful in recovery after a compromise.
  Important that the audit log not be tampered in case of compromise.
  Google: RPC system logs request approvals.  [[ "Inter-Service Access Management" ]]

Distributed systems: secure channels.
  Need to communicate between isolated components.
  On a single machine, OS kernel can provide secure channels.
  Over the network, usually use cryptography: encrypt/decrypt, sign/verify.
  Example: RPCs encrypted.
  In particular, RPCs that go across data centers are always encrypted.
    In response to NSA tapping Google's inter-data-center links.
    [[ Ref: http://www.theverge.com/2013/10/30/5046958/nsa-secretly-taps-into-google-yahoo-networks-to-collect-information ]]
  Powerful technique: encryption of stored data.
    Think of this as sending a message to yourself (but storing msg @ server).
    Google's Chrome browser implements synchronization (bookmarks, ..) this way.
      Server's security does not matter (for confidentiality and integrity).
    This paper: encrypt data at rest to defend against malicious disk firmware.
      Also helpful when decomissioning disks.
      (Where is secure deletion in our model otherwise?  Isolation guarantee.)
    With data encryption, challenges shift to key management, sharing.
    [[ Some interesting possibilities with computing on encrypted data (FHE, CryptDB). ]]

Decentralized systems: reasoning about trust.
  In a large system, trusting all components / machines usually not a good plan.
  Need to reason carefully about what principal issued a request,
    and why it should be authorized.
  Request might pass through several intermediaries.

Example: service receives a network packet containing a request.
  Should this request be executed?
  Where did the request came from?  Secure channel over the network.
  Who is on the other end of this secure channel?
  Need to authenticate endpoints!
    Google: each service has a principal name.
    Also important to authenticate what service a client connects TO.
    Otherwise, might inadvertently connect to the wrong service, even if the
      channel is secure..

Least privilege.
  Compromise of a privileged service can be damaging.
  Limiting privileges reduces the potential damage.
  Google example: "end-user permission ticket".
    Gmail service not authorized to issue arbitrary fetches to Contacts service.
    Must present proof that Gmail service is operating on behalf of an end-user.
      Trust combination of Gmail's service principal and end-user ticket.
    Partial motivation: limit insiders' ability to access arbitrary user data.
      [[ Ref: https://googleblog.blogspot.com/2010/01/new-approach-to-china.html ]]
    Looks like tickets are also tied to data encryption in Google's architecture!

Trusting some machine in a data center.
  How does Google know it's safe to start a service on a particular machine?
    Specifically, need to give the service's private key to that machine..
  What would we want to know?
    The hardware is a machine that Google purchased from its chosen vendor.
    The hardware is running the expected Google software.
    The software is up-to-date.
      .. this one is easy to get from the previous one.
  Hardware checks that the software is signed by Google's private key.
  How would a remote client be convinced of this check?
    "Security chip": not described in any detail, unfortunately.
    Will look at some trusted hardware papers later in the semester.
  Likely guess: security chip has a private key tied to a particular machine.
    Google knows corresponding private keys for all machines it purchased.
  Another guess: security chip generates new public/private key each time
    the machine boots, gives the token to the BIOS, along with signature
    of the BIOS code and the token.
  Why is this signature useful in convincing to anyone?
    The security chip gave this key only to the BIOS.
    The certificate says what code the BIOS ran.
    The BIOS will run subsequent code only if it's signed by Google.
    Someone that knows this key would have gotten it from the BIOS.

This kind of trust reasoning is formalized with "speaks-for" and "says".
  [[ Ref: http://dl.acm.org/citation.cfm?id=138874 ]]
  [[ Ref: http://dl.acm.org/citation.cfm?id=174614 ]]

Availability: DoS attacks.
  At some level, availability is just another kind of guard decision.
    Allow request to use resources (or not allow).
    But pragmatically, DoS is different than confidentiality / integrity.
  Basic problem 1: need to spend resources for the guard to make a decision.
  Basic problem 2: no strong authentication on the Internet, so guard is
    stuck without a strong notion of an authenticated principal..

  Many kinds of resources might be the target for a DoS attack.
    Network bandwidth.
    Router CPU / memory.
      Small packets, unusual packet options, routing protocols.
    Server memory.
      Protocol state (SYN floods, ..)
    Server CPU.
      Expensive application-level logic.

  Some broad principles to mitigate DoS attacks.
    Authenticate as soon as possible.
      Avoid resource commitments before authentication.
    Avoid asymmetric resource use before authentication.
      DNS amplification.
      Selective resource challenges.
    Factor out components that handle requests before authentication.
      Google: GFE, login service.
    Limit / prioritize resource use after authentication.
      Legitimate authenticated users should get priority.

  Google also implements various heuristics to filter out requests in GFE.

Implementation.
  Trusted Computing Base (TCB): code responsible for security.
    Keep it small.
    Depends on what the security goal is.
    Sharing physical machines in Google's cloud: KVM.
  Verification.
  Design reviews.
  Fuzzing, bug-finding.
    Red-team / rewards program.
  Safe libraries to avoid common classes of bugs.
  "Band-aids" / "defense in depth" increases attack cost.
    Firewalls, memory safety, intrusion detection, ...

Configuration: even if implementation is bug-free, system can be misconfigured.
  Groups.
  Roles.
  Expert management.
  Want fine-grained policy for flexibility vs coarse-grained for manageability.

Summary of security architecture.
  Isolation.
  Guard model.
  Gold standard: Authorize/Authenticate/Audit.
  Secure channels.
  Trust in decentralized systems.
  Privilege separation, least privilege.
  Small TCB, verification / bug-finding.

  Simplicity.
  Perfect is the enemy of the good.
  Lower aspirations.
  Security vs inconvenience.