Side-channel attacks
====================

quiz 1 soln posted

how does RSA work?
    pick two primes, p and q; let n=p*q
    idea will be for us to encrypt messages by exponentiating them mod n
    and decrypt messages by another exponentiation mod n
    so, generate a public key (e, for encrypt) and private key (d, for decrypt)
    property we want: m^e^d = m^ed = m  (mod n)
    how to pick these e and d in this way?
    turns out that e*d=1 mod (p-1)*(q-1) gives us such a pair of exponents
    if you know p and q, easy to compute (p-1)*(q-1), and thus do division
    typical trick: pick e to be small (e.g. 3 or 0x10001=65537), compute d
    makes it easy to encrypt messages

RSA can also be used for signing
    private key owner generates a signature by computing c=m^d mod n
    anyone can verify this came from the private key owner (compute c^e mod n)
    others should not be able to generate c, because they don't know d

some other practical considerations for RSA
    RSA is multiplicative
    use fixed & random padding to prevent various attacks
	attacker can synthesize encryptions/signings of any products

what does the key owner have to hide?
    e and n are public
    keep d secret, obviously
    crucial to keep factors p and q secret too!
	otherwise attacker computes d from e
	big assumption: factoring n is hard
    turns out that p and q are helpful in making RSA fast
	so almost all RSA implementations keep them around

how is RSA implemented?
    main operation: modular exponentiation
	given ciphertext c, decrypt it as m=c^d mod n
	naive impl: take c, multiply it d times, take mod n

	encryption is relatively easy: e is a small number
	two problems for d:
	    d is very large (repeated multiply would take forever)
	    c^d grows very large

    make the problem half as hard by knowing p, q
	chinese remainder theorem:
	    given a (mod x), a (mod y), can efficiently find a (mod xy)
	in RSA, we have just this situation
	    compute m1=c^d1 mod p, m2=c^d2 mod q
		d1 = e^-1 mod (p-1)
		d2 = e^-1 mod (q-1)
	    use CRT to recover m=c^d mod (p*q)
	so, most RSA impls keep (p,q,d1,d2) and perform modular exps mod p,q

    exponentiation: repeated squaring
	can perform exponentiation by considering each bit in d in turn
	key observations:
	    c^(2x)   = (c^x)^2
	    c^(2x+1) = (c^x)^2 * c
	start with the most-significant bit of d and work down to least-sig

    sliding window
	instead of doing one bit at a time, do many bits at a time
	openssl does 5 bits at a time (for 1024-bit keys)
	pre-compute c, c^3, c^5, c^7, ..., c^31
	    c^(2x)    = (c^x)^2
	    c^(32x+z) = (c^x)^32 * c^z    (z odd from 1 to 31)

    montgomery representation
	problem: c^d grows very large before we make it "mod q" in the end
	    if c was 512 bits, and d is 512 bits, then c^d is 256K bits!
	solution: reduce mod q at each step
	problem 2: reducing mod q is very expensive!  basically division
	    (division is far more expensive than multiply)
	peter montgomery's trick
	    shift numbers into a different representation up front
	    shift them back into the original representation afterwards
	    representation: multiply by some factor R (divide by R to recover)
		a      -> a*R
		b      -> b*R
		c = ab -> aR*bR = ab*RR
	    when we multply two numbers, need to divide by R
		that puts us back in the 512-bit ballpark; don't need mod q
	    real trick: make R a power of two (smallest larger than q)
		divide by R becomes easy (bit-shifting)
	    if only our lower bits were zero when we had to divide by R..
		but everything is mod q, so we can add any multiples of q
		add q, 2q, 4q, ... as needed to "shoot down" low-order bits
		there's a more efficient variant using a fixed number of mul's
	    remaining step: if cR (ab*R) is greater than q, subtract q
		called an "extra reduction"
		Schindler's observation:
		    Pr[extra red] when computing (c^d mod q) is (c mod q) / 2R
		    so prob. grows as c approaches q [c is in montgomery-form]
		see figure 1

    multiplication routines
	cannot multiply 1024-bit numbers natively
	normal variant: pairwise multiplication of 32-bit words in each number
	    takes O(nm) time, where n,m is number of words in each operand
	    about O(n^2) when n,m are close
	Karatsuba mult: works when operands are same size, takes O(n^1.58) time

brumley's attack plan
    timing of overall crypto ops
    two factors:
	1. montgomery extra reductions (figure 1) take more time for some c's
	2. karatsuba mult takes less time when numbers are exactly same size
    1: as ciphertext c approaches q, more extra montgomery reds, longer time
	when c exceeds q, extra reductions disappear, time drops
    2: as ciphertext c approaches q, multiplications are Karatsuba (c~=q), fast
	when c exceeds q, (c mod q) becomes very small, thus normal mult, slow
    effects work in opposite directions
    turns out you can still recover enough information

    typical SSL web server will perform a decryption using its private key
    server will respond with something (e.g. padding failure or next message)
    can measure the time it took the server to decrypt our supplied ciphertext

attack details
    guess i'th bit of factor q (assume we know the higher-ordered bits)
    since we know the higher-ordered bits, we have an upper & lower bound on q
	q = XXXX ????   [ XXXX is known ]
    pick two guesses:
	g    = XXXX 0000
	g_hi = XXXX 1000
    use montgomery red's to find where, btwn upper & lower bound, time drops
    can't send our guesses of q directly; they'll be transformed *R for mont.
    but R is known (next highest power of two, so 512 bits)
    divide guesses by R, send them to the server, see if q is above/below guess
    what are the possible time differences between decrypting g and g_hi?
	small difference: did not cross q, so q is above g_hi (i'th bit is 1)
	large difference: crossed q, montgomery reds reduce time, mult increases
    what dominates the "large" time difference?
	montgomery reds dominate for first 32 bits
	mult time (normal vs karatsuba) dominates for bits after 32
	flips again when (guess mod n) becomes very small
	    normal mult becomes more efficient than karatsuba: nm < n^1.58
	figure 3a

    neighborhood: probe a few values around guess to get stronger signal

defenses
    RSA blinding
	pick random r
	multiply by r^e mod n before decrypting
	divide by r after decrypting
	ensures attacker does not control ciphertext bits
	need to protect r from disclosure/timing analysis; avoid reuse
	results: figure 8
    ensure code performs same amount of work in all cases
	always use the same algorithms, perform extra reductions in all cases
	overall version: quantizing encryption time
	inefficient and brittle

percival's attack
    timing, but not of crypto code itself
    use timing to extract information about cache access patterns
    figure 2: cost of accessing cache lines
	what does the figure show?
	what do the grayscale values mean?
	why are there a few different grayscale values?
	figure generated by running code in figure 1
	shows what offsets other process is using
    squaring vs. multiply uses different memory: can tell zero vs one bits
    can also figure out which sliding-window multiples are in use
    potential shared resources:
	execution units themselves (can tell how many multiplies are done?)
	L1, L2, branch predictor, BTB, TLB
	memory bus

defenses
    hardware cache eviction policies
    OS scheduling policies
    cache-aware applications that hide their cache usage patterns?
    noisy clock/no clock: only if there's a single CPU and no external world

how applicable are these attacks?
    cache attacks: shared systems
	recent results suggest you might be able to do it from within JS!
	shared VM hosting systems like Amazon EC2
	remote login systems like the athena dialups
    timing attacks: low-noise timing info, many requests
	does the server log all requests?
	in some cases suffices to watch the server's network
	things are easier in a shared VM infrastructure; can rent a nearby VM

other attacks?
    tempest, RF emissions
    displays leak enough RF to reconstruct image
    keyboards leak enough RF to reconstruct keystrokes within 50 feet
    power usage: especially bad for smartcards