6.824 2014 Lecture 8: Egalitarian Paxos ======================================= broader picture: how to build a complete, fault-tolerant distributed system? as we are seeing, there's lots of issues to worry about consensus (paxos) persistence (harp) performance (epaxos) ... why are we reading these papers? you might end up building large-scale distributed systems papers will help you understand many of these issues some clear techniques / protocols that you can just use some design patterns or approaches still a hard problem papers focus on different aspects of system design e.g., harp focuses on persistence, recovery, restoring state epaxos does not explain how to address these in detail in practice, will often need to combine ideas from multiple papers epaxos is a high-performance paxos-based RSM system to understand epaxos, will first discuss several simpler designs one approach: harp: primary-backup, use paxos to agree on views what's wrong with this plan? - recovering from primary or backup failure takes a long time. - throughput limited by slowest of primary / backup. another approach: lab 3b: use paxos to agree on each operation simple doesn't deal with any problems that harp faced (persistence) doesn't care about performance almost at all how does lab 3b work? work through example scenarios how to pick log position for new ops? simple plan: px.Max+1 what happens if we guess too low? learn the decided value, which increases px.Max; try again what happens if we guess too high? paxos succeeds just fine when can we reply to the client? if the operation requires no response, reply after paxos agreement if operation requires response, have to apply request log entry when can we apply the log's entries? cannot apply an operation when it's agreed upon by paxos must apply all log entries in-order replica has to remember last applied op periodically check if next op is decided; if so, apply it what if next op will never be decided (guessed too high)? know that client is waiting for response to higher op propose a nop in this case, wait for a while what if proposer fails? same: propose a nop, wait for a while might learn the nop, or the real op what happens to client if it doesn't hear back from server? request contains client/reqid server state is both the kv-map and reply-map ensures exactly-once across all replicas performance cost for lab 3b client -> any server server acts as proposer proposer -> acceptors: prepare (2f+1) acceptors -> proposer: prepare_ok (at least f+1) proposer -> acceptors: accept (2f+1) acceptors -> proposer: accept_ok (at least f+1) proposer -> learners: decided (2f+1) what if two servers try to execute an operation in the same slot? paxos ensures only one will be committed in the slot but can take a large number of messages (unbounded) system-level optimization: designated leader ("multi-paxos") avoid the prepare/prepare_ok phase: prepare all future instances upfront avoid dueling leaders how does designated leader election work? what happens when designated leader fails? what if we get two designated leaders? what are the performance costs of multi-paxos? client -> designated leader leader -> acceptors: accept (2f+1) acceptors -> leader: accept_ok (at least f+1) leader -> learners: decided (2f+1) in the common case, no dueling proposers [ paxos-level optimization: acceptors send accept_ok directly to learners ] learners can infer decided lower latency until all learners get the result more messages sent overall (each acceptor to each learner) about the same client latency (leader learns outcome, replies to client) [ paxos-level optimization: clients directly send accept to acceptors ] called "fast-paxos" 2 message delays in the common case of no conflicts need a recovery protocol (another 2 message delays) for conflict resolution main limitation of multi-paxos: all requests go to the same designated leader leader must process many paxos messages for each client request leader will be bottlenecked on CPU, network before other replicas [ fast-paxos not a great fit when many clients issue requests: conflicts ] [ system-level optimization: batching ] put many ops into a single paxos slot better throughput: just one paxos exchange for N ops worse latency: ops waiting for a batch to fill up reduces but does not eliminate leader bottleneck system-level / paxos-level optimization: mencius stripe the log positions across all of the replicas each replica is a designated leader for its positions in the log replicas that have no pending operations will "skip" their slot: propose nop - when to propose nop? - what if some node dies? another node times out, proposes nop. main limitations of mencius fixed assignment of slots to nodes: good when all nodes to process ~even load all nodes must contact all other nodes: can't advance log otherwise might not be on the critical path for a single client operation shows up when next client operation comes in: must hear about f-1 slots slow failover: need to timeout on slots assigned to the failed replica does this matter? maybe, if your paxos servers have different hardware, or have other load maybe, if your replicas fail or get disconnected with some frequency maybe, if your replicas are geographically distributed what's the latency between data centers? east coast -- west coast: ~80msec west coast -- japan: ~100msec east coast -- europe: ~80msec west coast -- australia: ~150msec goals for epaxos [ Ref: Iulian Moraru's SOSP13 talk ] 1. high throughput, low latency 2. fast failure recovery 3. load balancing 4. use fastest replicas 5. use closest replicas (latency) how do the other plans stack up? lab 3b: good 2-5, bad 1 multi-paxos: good 1, bad 2-5 mencius: bad 2, unclear 3+4 (if some replica is slow), good 1+5 system-level / paxos-level optimization: epaxos each replica has its own log of ops replica has "pre-prepared" all slots in its log, like multi-paxos one accept/accept-ok round-trip sufficient to agree on an op in a slot no longer a single sequential log: order is unclear each op comes with a dependency list: other ops that have to run before it where do the dependencies come from in epaxos? issuing replica might know of the most recent conflicting ops during first round of messages, acceptors might inform issuer about conflicts acceptor looks at OTHER instances to detect conflicts because of majority overlap, guaranteed to learn about conflicting ops first round of messages: pre-accept / pre-accept-ok pre-accept-ok response includes conflict list if everyone agrees on conflict list, done; send decided msgs asynchronously when do two commands conflict? interference: application-specified e.g., kv store: same key for different keys, order does not matter promise made by the application to the epaxos system example of normal operation client->S1: A S1->S2,S3: pre-accept(A, deps_A={}); response deps_A={} S1->client: A done client->S5: B S5->S3,S4: pre-accept(B, deps_B={}); response deps_B={} from S4, deps_B={A} from S3 S5->S3,S4: accept(B, deps_B={A}) S5->client: B done client->S1: C S1->S2,S3: pre-accept(C, deps_C={A}); response deps_C={A} from S2, deps_C={A,B} from S3 S1->S2,S3: accept(C, deps_C={A,B}) S1->client: C done why is the second phase (in figure 2) sometimes necessary and sometimes not? if all responses are same, second phase not needed quorum of servers remembers the pre-accept-ok response if proposer crashes, recovery ensures that operation will go into the slot if some responses differ, need to agree on what that slot held in particular, deps and seq proposer must ensure quorum learns the unique outcome otherwise recovery might produce a different outcome why is it a "fast-path" quorum when it's more than f+1? because we can be done in a single RTT how big can the deps list get? epaxos assumes dependencies are transitive safe to include just the latest conflicting op from each replica what if messages get delayed, two conflicting ops switch their dependencies? not a problem: clients issued them concurrently, either order is allowed dependencies will ensure a consistent execution order on all replicas what if more dependencies arise after the second round of messages for op A? they must have been told about A in response from at least one replica will include A in their dependency list must be reflexive: if A conflicts with B, then B conflicts with A can the dependency graph have cycles? yes: S1->S2: pre-accept A; response deps empty S4->S2: pre-accept B; response deps A S4->S3: pre-accept B; response deps empty S1->S3: pre-accept A; response deps B now deps_A={B} and deps_B={A} how to execute ops? assume no cycles for now. linearize the dependency graph: topological sort execute ops in order of dependencies is it possible that a dependency has not been decided yet? yes: proposer sent the op in a single pre-accept message, then crashed what to do if the dependency is missing? maybe wait a little bit, to see if it shows up run the explicit prepare algorithm, to agree on a nop (or real op) what if the proposer was just slow, not really crashed? epaxos's modified paxos protocol will detect it has to agree on the nop must retry putting the desired op in the next available slot what if different replicas arrive at a different topological sort order? fine: app promised order does not matter for non-conflicting ops how to deal with cycles? compute strongly-connected components, execute entire SCCs in dep order how to order operations within a strongly-connected component? sequence number attached to each operation (in addition to deps list) what if SCC grows after some replica already executed some ops in that SCC? cannot happen: deps list of all SCC members is fixed before executing, must wait for agreement on all dependent ops each of those ops has a fixed deps list new ops can depend on the SCC, but that's fine: they come later can SCC grow after some op in SCC has been committed + replied to client? yes S1->S2: pre-accept(A, deps_A={}); reply deps_A={} S2->S3: pre-accept(B, deps_B={A}); reply deps_B={A} S1->S3: pre-accept(A, deps_A={}); reply deps_A={B} S1->S2,S3: accept(A, deps_A={B}) S1 responds to client, A is done client issues op C to S3 S3->S4: pre-accept(C, deps_C={A,B}); reply deps_C={A,B} S2->S4: pre-accept(B, deps_B={A}); reply deps_B={A,C} .. now A, B, C form an SCC, and this can continue to grow the SCC can we guarantee that A will run before C in this case? (linearizability) seq numbers before A replies to client, A is committed, so has a stable seq# C must observe A's seq# (by majority overlap), so C's seq# will be higher when SCC is ordered by seq#, A will go first what's the answer to the lecture question (when does epaxos execute an op)? in principle, no need to execute op until it's needed for some read write ops can be put into a slot and acknowledged, no result needed might want to proactively execute ops to reduce eventual read latency must wait for a command to be committed means stable command, stable dependency list execute all of its dependencies first why does paxos need this weird f+floor((f+1)/2) fast-path quorum? has to do with the recovery procedure (somewhat complicated, in tech report) need 7 servers and 2 clients to demonstrate problem suppose we did pre-accept (fast path) with traditional f+1 quorum C1 -> S1: operation A S1 -> S2,S3,S4: pre-accept A; response deps_A={} C1 <- S1: ok S1 crashes before sending decided messages C1 -> C2: look, A executed C2 -> S7: operation B S7 -> S6,S5,S4: pre-accept B; response from S6,S5: deps_B={} response from S4: deps_B={A} S7 crashes S4 crashes must execute A then B but the state at remaining 4 servers (S2,S3,S5,S6) is symmetric S2,S3: pre-accepted A, empty deps S5,S6: pre-accepted B, empty deps no idea which to execute first might end up executing B then A: wrong! f+floor((f+1)/2) quorum allows epaxos recovery to determine which goes first figure 4, right half (5 replicas) EPaxos 0% why is the latency in VA, CA, OR lower than JP and EU? what sites does each replica have to talk to, in order to commit a command? VA: CA, OR CA: OR, VA OR: CA, VA EU: VA, CA [150ms] JP: CA, OR [130ms] EPaxos 100% why is the latency double? need second round after updating deps. why is the latency not even higher? might think that JP needs to execute ops from EU [300ms] write operations: no need to execute before replying to client execute latency is indeed a little higher: wait for decided msg on deps why does multi-paxos sometimes have high latency? clients have to talk to the designated leader, which might be far away but for clients that are lucky (next to designated leader), latency is low when does mencius have high latency? clients can talk to their nearby replica replica still has to know what happened in the other log slots if all replicas are running commands, will run somewhat in sync if only one replica is issuing commands, it has to poke other replicas figure 10: dealing with replica failure why does multi-paxos drop to 0? lost leader why does mencius spike after a while? each replica committed a bunch of commands in its own slots after a timeout, failed replica's slots get filled with nop's why does epaxos remain flat? clients issue 10k req/s, servers can handle much more system not saturated must be that clients have a very short timeout, try another server can we shorten the timeout in multi-paxos, mencius? risk: false positives when replica didn't crash multi-paxos: dueling leaders, undesirable situation mencius: nops end up replacing real ops, slows down the replaced ops does epaxos matter if your replicas are all within a single datacenter? figure 5: matters for small commands (16 bytes) epaxos 0% wins over mencius due to sending fewer msgs epaxos wins over multi-paxos because of leader CPU bottleneck references: http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-13-111.pdf https://www.youtube.com/watch?v=KxoWlUZNKn8