Security architecture ===================== What is "security architecture"? Structuring of entire systems in order to: Defend against large classes of attacks Prevent as-yet-unknown attacks Contain damage from successful attacks We want to get ahead of attackers And not just react by e.g. applying patches Security architecture consists of: Ways of analyzing the security situation What are we defending? Credit cards ? Crypto keys? Trade secrets? Everything? Who is the attacker? Spammers? Employees? Vendors? Customers? Competitors? What powers are we assuming attack does/doesn't have? (all this is usually called the Threat Model) Principles Minimize trust Techniques isolation, authentication, privilege separation, secure channels, &c Case study: Google Security Architecture paper Paper focuses on Google's cloud platform offering. Does not describe all security aspects of all Google services. The paper touches on many interesting, complex topics. Good overview of what a security architecture can look like. See Butler Lampson's talk for discussion of principles. [[ Ref: http://css.csail.mit.edu/6.858/2015/lec/lampson.pdf ]] Google is unique, but paper is super useful even if you're not Google. Google's uniqueness: designed and developed many ideas and infrastructure. Also physical / hardware security that's costly to replicate. Many components now available for anyone to use. Can benefit from Google's data center security by using a cloud provider. Google and other cloud providers (AWS, Azure) have similar plans. Cloud providers offer isolation, controlled sharing, authentication, etc. [ Ref: https://aws.amazon.com/blogs/security/building-fine-grained-authorization-using-amazon-cognito-api-gateway-and-iam/ ] Many components are open-source. Kubernetes, gRPC, security libraries, bug-finding tools, fuzzing, ... Integrated development / deployment services, "devops" / CI-CD. Paper illustrates systematic thinking about wide range of possible threats. Useful separately from whether you're building or using Google's tools. Why would Google publish a document like this? What are the security goals in the Google paper? Avoid disclosure of customer data (e.g., e-mail). Ensure availability of Google's applications. Track down what went wrong if a compromise occurs. Help Google engineers build secure applications. Broadly, ensure customers trust Google. Worried about many threats; examples: Bugs in Google's software Compromised networks (customer, Internet, Google's internal) Stolen employee passwords Malware on employee workstations / smartphones Insider attacks (bribing an engineer or data center operator) Malicious server hardware Data on discarded disks What's the Google server environment? Data centers. Physical machines. Virtual machines. Services in VMs. Applications (both Google's and customer) in VMs. RPC between applications and services. Front-end servers convert HTTP/HTTPS into RPC. Isolation: the starting point for security The goal: by default, activity X cannot affect activity Y even if X is malicious even if Y has bugs Without isolation, there's no hope for security With isolation, we can allow interaction (if desired) and control it Examples of isolation in Google's design? [[ "Service Identity, Integrity, and Isolation" ]] Linux user separation. Language sandboxes. Kernel sandboxes. Virtual machines. Dedicated machines, for particularly sensitive services. What is isolation doing for Google? Let's look at virtual machines. Each physical machine has a host VMM, which supervises many guest VMs. Each guest VM runs an O/S &c. Allows sharing of machines between unrelated activities. Many activities need only a fraction of a machine. One point: VMM helps keep out attackers in other VM guests. Google storage server in one VM, customer in another VM, or Storage in one VM, compromised Google Contacts service in another VM. Another point: VMM helps keep attackers *in* -- confinement. It is safe to run almost any code in a VM guest, even in guest kernel. As long as we can control who it talks to over the network. Do VMs provide perfect isolation between guests? Controlled sharing on top of isolation. Need to decide what operations should be allowed. Three steps for controlled sharing ("gold standard"): Authenticate: determine which principal is sending request. Authorize: determine which operations are allowed. Audit: log operation for later auditing. Google: "Access transparency". Authentication principals in Google's design. RPC system: services, engineers. Also end-user identity, but above RPC. Examples of authorization plans in Google's design? [[ "Inter-Service Access Management" ]] Administrator white-lists who can use each service. Principals = other services, engineers. Automatic enforcement by the RPC infrastructure. "End-user permission ticket" (e.g., to access Contacts service). Rights to perform operations on behalf of an end-user. Ticket is short-lived to limit damage if stolen. Both are particularly slick and unusual aspects of Google's architecture. Much of Google's paper is a reaction to weaknesses of perimeter defense. No story for anything going wrong inside. i.e. no second line of defense if successful penetration. Motivated by some early attacks against Google. [ Ref: https://en.wikipedia.org/wiki/Operation_Aurora ] [ Ref: https://www.youtube.com/playlist?list=PL590L5WQmH8dsxxz7ooJAgmijwOz0lh2H ] Paper essentially does controlled sharing at a service level. Guard = accept RPCs only from approved client services. e.g. GMail can talk to Contacts, but other services can't. This is an example of Least Privilege. Split up the activities, isolate them. Give each activity only the privileges it needs. Google's end-user permission tickets reduce privilege even further. RPC must be from approved service, user must actually be logged in! Motivation: limit damage from buggy requesting services asking for wrong data. Motivation: limit insiders' ability to access arbitrary user data. Perhaps tickets are also tied to data encryption in Google's architecture! How does one service authenticate request from another? We need "secure channels" over the network. Cryptography: encrypt/decrypt, sign/verify. Signatures proves who sent a message (integrity). Encryption ensures only intended recipient can read (confidentiality). Thus: Google servers (probably) sign RPCs to other servers. RPC system automatically limits who can talk to who. RPCs are encrypted between data centers, over the internet. Maybe also encrypted within data center (why?). Cryptography shifts challenges to key management. E.g. when GMail service talks to Contacts service. RPC sender needs to know what key to encrypt with. RPC receiver needs to know who corresponds to signing key. Google clearly runs a name service, mapping service names to public keys. How does Google know it's safe to use a particular computer as a server? Using a server means Google has to trust its hardware/BIOS/&c: Sensitive data, crypto keys, RPC authorization. Why might there be a problem -- what attacks? Attacker physically swaps their own server for one of Google's. Attacker breaks into a good server, changes the O/S on disk. Vendor ships Google a machine with corrupt BIOS. Attacker breaks into a good server, "updates" BIOS to something bad. What's Google's defense strategy? (Lots of guesses here, we'll look at real designs later) They design their own motherboards, and their own "security chip". Security chip intervenes during boot process. Security chip checks that BIOS and O/S are signed by Google's private key. Security chip has a unique private key tied to a particular machine. Security chip is willing to sign statements when asked by the software. Statement includes identity (hash) of booted BIOS and O/S. Google services require authentication including chip signed statement. Google services can check a client's signed statement: Google has DB of security chip public keys for all machines it purchased. Google has a DB of acceptable BIOS and O/S hashes. [ Ref: https://cloud.google.com/blog/products/identity-security/titan-in-depth-security-in-plaintext ] So: what if data center employee inserts machine with correct IP address? So: what if vendor ships Google a machine with BIOS that snoops? Availability: DoS attacks. The problem: Attacker wants to take your web site off the air, or blackmail you. They assemble a "botnet" of 10,000 random Internet machines. They sent vast quantities of requests to your web site. Many kinds of resources might be the target for a DoS attack. Network bandwidth. Router CPU / memory. Small packets, unusual packet options, routing protocols. Server memory. Protocol state (SYN floods, ..) Server CPU. Expensive application actions. A core DoS problem: hard to distinguish attack traffic from real traffic. Some broad principles to mitigate DoS attacks. Massive server-side resources, with load spreading/balance. Authenticate as soon as possible. Minimize resource consumption before authentication. E.g. minimize server-side TCP connection setup state. Factor out components that handle requests before authentication. Google: GFE, login service. Limit / prioritize resource use after authentication. Legitimate authenticated users should get priority. Google also implements various heuristics to filter out requests in GFE. Implementation. Trusted Computing Base (TCB): code responsible for security. Keep it small. Depends on what the security goal is. Sharing physical machines in Google's cloud: KVM. Verification. Design reviews. Fuzzing, bug-finding. Red-team / rewards program. Safe libraries to avoid common classes of bugs. "Band-aids" / "defense in depth" increases attack cost. Firewalls, memory safety, intrusion detection, ... How to know these source code rules are applied to actual services? Integrated build / deployment system, Borg. Open-source offshoot: Kubernetes. Borg is in charge of running services across machines. Borg also is in charge of building source code to produce binary. Operators specify policy in terms of what software should run. Borg ensures policies are followed for source code. Borg takes care of building, testing, deploying the resulting binary, etc. Keeps log of deployed versions, in case a bug is later discovered. Summary of security architecture. Isolation. Sharing: Authorize/Authenticate/Audit. Secure channels. Hardware root of trust for servers registering with cluster. Integrated software development / build / deployment. Translates software policies to running services. Privilege separation, least privilege. Small TCB, verification / bug-finding.