Side-channel attacks ==================== quiz 1 soln posted how does RSA work? pick two primes, p and q; let n=p*q idea will be for us to encrypt messages by exponentiating them mod n and decrypt messages by another exponentiation mod n so, generate a public key (e, for encrypt) and private key (d, for decrypt) property we want: m^e^d = m^ed = m (mod n) how to pick these e and d in this way? turns out that e*d=1 mod (p-1)*(q-1) gives us such a pair of exponents if you know p and q, easy to compute (p-1)*(q-1), and thus do division typical trick: pick e to be small (e.g. 3 or 0x10001=65537), compute d makes it easy to encrypt messages RSA can also be used for signing private key owner generates a signature by computing c=m^d mod n anyone can verify this came from the private key owner (compute c^e mod n) others should not be able to generate c, because they don't know d some other practical considerations for RSA RSA is multiplicative use fixed & random padding to prevent various attacks attacker can synthesize encryptions/signings of any products what does the key owner have to hide? e and n are public keep d secret, obviously crucial to keep factors p and q secret too! otherwise attacker computes d from e big assumption: factoring n is hard turns out that p and q are helpful in making RSA fast so almost all RSA implementations keep them around how is RSA implemented? main operation: modular exponentiation given ciphertext c, decrypt it as m=c^d mod n naive impl: take c, multiply it d times, take mod n encryption is relatively easy: e is a small number two problems for d: d is very large (repeated multiply would take forever) c^d grows very large make the problem half as hard by knowing p, q chinese remainder theorem: given a (mod x), a (mod y), can efficiently find a (mod xy) in RSA, we have just this situation compute m1=c^d1 mod p, m2=c^d2 mod q d1 = e^-1 mod (p-1) d2 = e^-1 mod (q-1) use CRT to recover m=c^d mod (p*q) so, most RSA impls keep (p,q,d1,d2) and perform modular exps mod p,q exponentiation: repeated squaring can perform exponentiation by considering each bit in d in turn key observations: c^(2x) = (c^x)^2 c^(2x+1) = (c^x)^2 * c start with the most-significant bit of d and work down to least-sig sliding window instead of doing one bit at a time, do many bits at a time openssl does 5 bits at a time (for 1024-bit keys) pre-compute c, c^3, c^5, c^7, ..., c^31 c^(2x) = (c^x)^2 c^(32x+z) = (c^x)^32 * c^z (z odd from 1 to 31) montgomery representation problem: c^d grows very large before we make it "mod q" in the end if c was 512 bits, and d is 512 bits, then c^d is 256K bits! solution: reduce mod q at each step problem 2: reducing mod q is very expensive! basically division (division is far more expensive than multiply) peter montgomery's trick shift numbers into a different representation up front shift them back into the original representation afterwards representation: multiply by some factor R (divide by R to recover) a -> a*R b -> b*R c = ab -> aR*bR = ab*RR when we multply two numbers, need to divide by R that puts us back in the 512-bit ballpark; don't need mod q real trick: make R a power of two (smallest larger than q) divide by R becomes easy (bit-shifting) if only our lower bits were zero when we had to divide by R.. but everything is mod q, so we can add any multiples of q add q, 2q, 4q, ... as needed to "shoot down" low-order bits there's a more efficient variant using a fixed number of mul's remaining step: if cR (ab*R) is greater than q, subtract q called an "extra reduction" Schindler's observation: Pr[extra red] when computing (c^d mod q) is (c mod q) / 2R so prob. grows as c approaches q [c is in montgomery-form] see figure 1 multiplication routines cannot multiply 1024-bit numbers natively normal variant: pairwise multiplication of 32-bit words in each number takes O(nm) time, where n,m is number of words in each operand about O(n^2) when n,m are close Karatsuba mult: works when operands are same size, takes O(n^1.58) time brumley's attack plan timing of overall crypto ops two factors: 1. montgomery extra reductions (figure 1) take more time for some c's 2. karatsuba mult takes less time when numbers are exactly same size 1: as ciphertext c approaches q, more extra montgomery reds, longer time when c exceeds q, extra reductions disappear, time drops 2: as ciphertext c approaches q, multiplications are Karatsuba (c~=q), fast when c exceeds q, (c mod q) becomes very small, thus normal mult, slow effects work in opposite directions turns out you can still recover enough information typical SSL web server will perform a decryption using its private key server will respond with something (e.g. padding failure or next message) can measure the time it took the server to decrypt our supplied ciphertext attack details guess i'th bit of factor q (assume we know the higher-ordered bits) since we know the higher-ordered bits, we have an upper & lower bound on q q = XXXX ???? [ XXXX is known ] pick two guesses: g = XXXX 0000 g_hi = XXXX 1000 use montgomery red's to find where, btwn upper & lower bound, time drops can't send our guesses of q directly; they'll be transformed *R for mont. but R is known (next highest power of two, so 512 bits) divide guesses by R, send them to the server, see if q is above/below guess what are the possible time differences between decrypting g and g_hi? small difference: did not cross q, so q is above g_hi (i'th bit is 1) large difference: crossed q, montgomery reds reduce time, mult increases what dominates the "large" time difference? montgomery reds dominate for first 32 bits mult time (normal vs karatsuba) dominates for bits after 32 flips again when (guess mod n) becomes very small normal mult becomes more efficient than karatsuba: nm < n^1.58 figure 3a neighborhood: probe a few values around guess to get stronger signal defenses RSA blinding pick random r multiply by r^e mod n before decrypting divide by r after decrypting ensures attacker does not control ciphertext bits need to protect r from disclosure/timing analysis; avoid reuse results: figure 8 ensure code performs same amount of work in all cases always use the same algorithms, perform extra reductions in all cases overall version: quantizing encryption time inefficient and brittle percival's attack timing, but not of crypto code itself use timing to extract information about cache access patterns figure 2: cost of accessing cache lines what does the figure show? what do the grayscale values mean? why are there a few different grayscale values? figure generated by running code in figure 1 shows what offsets other process is using squaring vs. multiply uses different memory: can tell zero vs one bits can also figure out which sliding-window multiples are in use potential shared resources: execution units themselves (can tell how many multiplies are done?) L1, L2, branch predictor, BTB, TLB memory bus defenses hardware cache eviction policies OS scheduling policies cache-aware applications that hide their cache usage patterns? noisy clock/no clock: only if there's a single CPU and no external world how applicable are these attacks? cache attacks: shared systems recent results suggest you might be able to do it from within JS! shared VM hosting systems like Amazon EC2 remote login systems like the athena dialups timing attacks: low-noise timing info, many requests does the server log all requests? in some cases suffices to watch the server's network things are easier in a shared VM infrastructure; can rent a nearby VM other attacks? tempest, RF emissions displays leak enough RF to reconstruct image keyboards leak enough RF to reconstruct keystrokes within 50 feet power usage: especially bad for smartcards