Security architecture
=====================

What is "security architecture"?
  Structuring of entire systems in order to:
    Defend against large classes of attacks
    Prevent as-yet-unknown attacks
    Contain damage from successful attacks
  We want to get ahead of attackers
    And not just react by e.g. applying patches

Security architecture consists of:
  Ways of analyzing the security situation
    What are we defending? Credit cards ? Crypto keys? Trade secrets? Everything?
    Who is the attacker? Spammers? Employees? Vendors? Customers? Competitors?
    What powers are we assuming attack does/doesn't have?
    (all this is usually called the Threat Model)
  Principles
    Minimize trust
  Techniques
    isolation, authentication, privilege separation, secure channels, &c

Case study: Google Security Architecture paper
  Paper focuses on Google's cloud platform offering.
  Does not describe all security aspects of all Google services.
  The paper touches on many interesting, complex topics.
  Good overview of what a security architecture can look like.
  See Butler Lampson's talk for discussion of principles.
    [[ Ref: http://css.csail.mit.edu/6.858/2015/lec/lampson.pdf ]]

Why would Google publish a document like this?

What are the security goals in the Google paper?
  Avoid disclosure of customer data (e.g., e-mail).
  Ensure availability of Google's applications.
  Track down what went wrong if a compromise occurs.
  Help Google engineers build secure applications.
  Broadly, ensure customers trust Google.

Worried about many threats; examples:
  Bugs in Google's software
  Compromised networks (customer, Internet, Google's internal)
  Stolen employee passwords
  Malware on employee workstations / smartphones
  Insider attacks (bribing an engineer or data center operator)
  Malicious server hardware
  Data on discarded disks

What's the Google server environment?
  Data centers.
  Physical machines.
  Virtual machines.
  Services in VMs.
  Applications (both Google's and customer) in VMs.
  RPC between applications and services.
  Front-end servers convert HTTP/HTTPS into RPC.

Isolation: the starting point for security
  The goal: by default, activity X cannot affect activity Y
    even if X is malicious
    even if Y has bugs
  Without isolation, there's no hope for security
  With isolation, we can allow interaction (if desired) and control it

Examples of isolation in Google's design?
  [[ "Service Identity, Integrity, and Isolation" ]]
  Linux user separation.
  Language sandboxes.
  Kernel sandboxes.
  Virtual machines.
  Dedicated machines, for particularly sensitive services.

What is isolation doing for Google?
  Let's look at virtual machines.
  Each physical machine has a host VMM, which supervises many guest VMs.
  Each guest VM runs an O/S &c.
  Allows sharing of machines between unrelated activities.
    Many activities need only a fraction of a machine.
  One point: VMM helps keep out attackers in other VM guests.
    Google storage server in one VM, customer in another VM, or
    Storage in one VM, compromised Google Contacts service in another VM.
  Another point: VMM helps keep attackers *in* -- confinement.
    It is safe to run almost any code in a VM guest, even in guest kernel.
    As long as we can control who it talks to over the network.
  Do VMs provide perfect isolation between guests?

Sharing: Reference Monitor.
  100% isolation is usually not what we want
  We need controlled sharing/interaction as well
  Here's a model for sharing:

                      +-----------------------+
                      |   Policy              |
                      |      |                |
              request |      V                |
  principal ----------|--> GUARD --> resource |
                      |      |                |
                      | +-----------+         |
                      | |    |      |         |
                      | |    V      |         |
                      | | Audit log |         |
                      | +-----------+         |
                      +-----------------------+
                      HOST enforcing isolation

This model has been very influential

Principals?
  Person, device, program, service, &c

Resources?
  Services, like Google's internal Contacts service.
  Items inside services, like files in a storage service.

What does the guard do?
  Authenticate!
  Authorize!
  Audit!

Authenticate: who/what is issuing the request?

How to authenticate a person?
  Passwords.
    What does Guard learn from a correct password?
    Can we do better?
  Two-factor authentication, e.g. send number to user via SMS.
    What does Guard learn from correct SMS two-factor authentication?
    Can we do better?
    E.g. user clicked on g00gle.com, owned by attacker?
      And g00gle.com then forwards user's password to google.com?
    Paper mentions this was a serious problem for employees, even with 2FA.
  Next lecture will talk a lot more about how to authenticate people.

Authorize: determine if the request should be allowed.
  Policy function: permissions = POLICY(principal, resource).
  Equivalent: access matrix.
          Alice Bob ...
    File1   Y    N
    File2   Y    Y
    ...
  Two typical ways to store policy.
    ACLs: store a row (slice by resource).
    Capabilities: store a cell (specific rights/permissions).
  ACLs widely used for long-term policy.
    File permissions, list of users with access to a shared doc, ...
    Typically stored with the protected object.
    Good at answering "who can access this resource?"
  Capabilities useful for short-term policy.
    File descriptors in an OS; object references in languages;
      cryptographic tokens in distributed protocols.
    Typically stored with the principal.
    Flexible, since typically applications can grant each other capabilities.
    Not great for long-term policy:
      Can't answer "who has access?"
      Revocation is tricky.

Examples of authorization plans in Google's design?
  [[ "Inter-Service Access Management" ]]
  ACL: administrator white-lists who can use each service.
    Principals = other services, engineers.
    Guard = automatic enforcement by the RPC infrastructure.
  Capability: "end-user permission ticket" (e.g., to access Contacts service).
    Capability to perform operations on behalf of an end-user.
    Ticket is short-lived to limit damage if stolen.
  Both are particularly slick and unusual aspects of Google's architecture.

Why the specific Reference Monitor structure?
  Separates policy from resource implementation
    To ease reasoning
    To ease evolution of policy
  Implication: avoid embedding security checks in resource code
    Even though embedding is often convenient!
  Note: relies on isolation, i.e. no access other than via guard

Is the Reference Monitor always the best model?
  Sometimes decisions are data-dependent.
    Must be made by the resource, not separate policy+guard.
    E.g. I can see your bid only after I place a higher bid.
  Sometimes attacks are not about access to data.
    E.g. DoS attacks.
  Sometimes not clear how to apply the model in a straightforward way.
    E.g. inside data centers.

How to apply the Reference Monitor idea to a whole set of machines?
  E.g. how to apply it to a company's computers?
  Old idea: perimeter defense.
    Isolate "inside the company" from "outside the company".
    Only one (logical) connection to the Internet.
    Firewall acts as a guard, stops attacks, removes malware, &c.
    Full access allowed inside the firewall.
  Perimeter defense with firewalls worked well for many years.

Much of Google's paper is a reaction to weaknesses of perimeter defense.
  No story for anything going wrong inside.
  i.e. no second line of defense if successful penetration.

Paper essentially applies Reference Monitor per service.
  Guard = accept RPCs only from approved client services.
  e.g. GMail can talk to Contacts, but other services can't.
  
This is an example of Least Privilege.
  Split up the activities, isolate them.
  Give each activity only the privileges it needs.

Google's end-user permission tickets reduce privilege even further.
  RPC must be from approved service, user must actually be logged in!
  Motivation: limit damage from buggy requesting services asking for wrong data.
  Motivation: limit insiders' ability to access arbitrary user data.
  Perhaps tickets are also tied to data encryption in Google's architecture!

How does one service authenticate request from another?
  We need "secure channels" over the network.
  Cryptography: encrypt/decrypt, sign/verify.
  Signatures proves who sent a message (integrity).
  Encryption ensures only intended recipient can read (confidentiality).
  Thus:
    Google servers (probably) sign RPCs to other servers.
      RPC system automatically limits who can talk to who.
    RPCs are encrypted between data centers, over the internet.
    Maybe also encrypted within data center (why?).

Cryptography shifts challenges to key management.
  E.g. when GMail service talks to Contacts service.
  RPC sender needs to know what key to encrypt with.
  RPC receiver needs to know who corresponds to signing key.
  Google clearly runs a name service, mapping service names to public keys.

How does Google know it's safe to use a particular computer as a server?
  Using a server means Google has to trust its hardware/BIOS/&c:
    Sensitive data, crypto keys, RPC authorization.
  Why might there be a problem -- what attacks?
    Attacker physically swaps their own server for one of Google's.
    Attacker breaks into a good server, changes the O/S on disk.
    Vendor ships Google a machine with corrupt BIOS.
    Attacker breaks into a good server, "updates" BIOS to something bad.
  What's Google's defense strategy?
    (Lots of guesses here, we'll look at real designs later)
    They design their own motherboards, and their own "security chip".
    Security chip intervenes during boot process.
    Security chip checks that BIOS and O/S are signed by Google's private key.
    Security chip has a unique private key tied to a particular machine.
    Security chip is willing to sign statements when asked by the software.
      Statement includes identity (hash) of booted BIOS and O/S.
    Google services require authentication including chip signed statement.
    Google services can check a client's signed statement:
      Google has DB of security chip public keys for all machines it purchased.
      Google has a DB of acceptable BIOS and O/S hashes.
  So: what if data center employee inserts machine with correct IP address?
  So: what if vendor ships Google a machine with BIOS that snoops?

Availability: DoS attacks.
  The problem:
    Attacker wants to take your web site off the air, or blackmail you.
    They assemble a "botnet" of 10,000 random Internet machines.
    They sent vast quantities of requests to your web site.

  Many kinds of resources might be the target for a DoS attack.
    Network bandwidth.
    Router CPU / memory.
      Small packets, unusual packet options, routing protocols.
    Server memory.
      Protocol state (SYN floods, ..)
    Server CPU.
      Expensive application actions.

  A core DoS problem: hard to distinguish attack traffic from real traffic.

  Some broad principles to mitigate DoS attacks.
    Massive server-side resources, with load spreading/balance.
    Authenticate as soon as possible.
    Minimize resource consumption before authentication.
      E.g. minimize server-side TCP connection setup state.
    Factor out components that handle requests before authentication.
      Google: GFE, login service.
    Limit / prioritize resource use after authentication.
      Legitimate authenticated users should get priority.

  Google also implements various heuristics to filter out requests in GFE.

Implementation.
  Trusted Computing Base (TCB): code responsible for security.
    Keep it small.
    Depends on what the security goal is.
    Sharing physical machines in Google's cloud: KVM.
  Verification.
  Design reviews.
  Fuzzing, bug-finding.
    Red-team / rewards program.
  Safe libraries to avoid common classes of bugs.
  "Band-aids" / "defense in depth" increases attack cost.
    Firewalls, memory safety, intrusion detection, ...

Configuration: even if implementation is bug-free, system can be misconfigured.
  Groups.
  Roles.
  Expert management.
  Want fine-grained policy for flexibility vs coarse-grained for manageability.

Summary of security architecture.
  Isolation.
  Reference Monitor model.
  Authorize/Authenticate/Audit.
  Secure channels.
  Privilege separation, least privilege.
  Small TCB, verification / bug-finding.

  Simplicity.
  Perfect is the enemy of the good.
  Lower aspirations.
  Security vs inconvenience.