6.824 2014 Lecture 10: Distributed transactions
===============================================

topic: coordinating execution across multiple machines
  subtly different goal from last few lectures
  previously: how to tolerate faults using replicas + paxos
  will see shortly why these are different topics

plan:
  1. (simplified) review of two-phase commit / two-phase locking
  2. lynx

typical scenario: airline, hotel, bank
  user wants to get airline ticket and hotel room, pay from his bank account
  straightforward approach:
    client reserves airline ticket
    airline transfers money from bank account
    client reserves hotel room
    hotel transfers money from bank account
  what might go wrong?
    no more hotel rooms after booking airline ticket
      undesirable for customer
    airline system crashes after transferring money but before reserving seat
      undesirable for customer
    or airline system crashes after reserving seat but before transferring money
      undesirable for airline
  goal: all-or-nothing atomicity
    i.e., either all parts executed, or none of them did
    challenging because parts run on different machines

achieving all-or-nothing atomicity: two-phase commit (2PC)
  coordinator is in charge of orchestrating user's transaction
  coordinator sends Prepare message to all servers
  each server decides if their part of the transaction can execute
    send back PrepareOK or PrepareAbort response to coordinator
    e.g., abort causes in our example:
      no seats left on flight
      no rooms left in hotel
      insufficient money in bank account
  if any server says PrepareAbort, coordinator sends Abort to all servers
    servers don't perform their part of transaction
  if all servers say PrepareOK, coordinator sends Commit to all servers
    servers execute their part of transaction
  at least in the simple case, 2PC achieves all-or-nothing atomicity
    transaction commits only if all servers are willing to execute it
    coordinator sends commit messages to all servers, so all should execute
  two complications: concurrency and failures

2PC with concurrency
  what if another op tries to execute on a server between Prepare + Commit?
  if server said PrepareOK, then promised to commit later
  cannot execute any operation that can preclude later commit
  Prepare makes a reservation, Commit actually commits it
    e.g., reserve seat/room, place a hold for some amount of $ in bank account
  concurrent operations cannot violate this reservation
  often done using locks: lock bank account, lock seat assignment, ..
  will say more about concurrency later on: two-phase locking (2PL)

2PC with network failures
  server doesn't hear from coordinator (Prepare)
    no problem -- as if nothing happened at that server
    coordinator can't commit without hearing from every server
  coordinator doesn't hear back from some server (PrepareOK or PrepareAbort)
    coordinator can retransmit Prepare request
    coordinator picks a unique txn id, which goes into all messages
    always safe for coordinator to abort (e.g., in case of a timeout)
  server doesn't hear from coordinator (Abort / Commit)
    if sent PrepareAbort, no problem: the only possible outcome is abort
    if sent PrepareCommit, have to block: must be prepared to commit
    can contact coordinator to learn outcome
    cannot release reservation/locks until committed or aborted

2PC with server crashes
  if server is part of an outstanding transaction, must remember state on disk
    i.e., if replied PrepareOK but haven't heard back either Abort or Commit
    must remain prepared to Commit: continue to hold reservation/locks
  if server eventually comes up, has to learn outcome from coordinator
  coordinator must remember transaction outcome until acked by all servers

2PC with coordinator crashes
  don't need to remember transactions that were aborted: safe to abort
  don't need to remember transactions that didn't get decided: safe to abort
  must remember committed transactions that haven't been ACKed by all servers
    cannot just abort them: might have sent Commit to some servers already
    server might not have heard whether transaction committed or aborted

why do we need 2PC when we already have Paxos?
  could three servers decide to commit if a majority is up and commits?
    no: cannot commit if hotel has no rooms, even if airline + bank are ok
  Paxos useful when servers hold the same state (replicas)
    all servers execute same operation, in the same way (commit or abort)
    any server can predict what any other server will do
    no need to consult every server -- there's just one operation
    if majority agrees, rest of servers will follow, and obtain the same result
  2PC useful when servers hold different state
    all servers execute different operations
    one server cannot predict what another server will do
    have to consult every server
  not practical for all servers to hold same state
    fundamentally, cannot have all data live on a single server
    either administrative/policy reasons or capacity/performance reasons
  if operation involves many servers with different state, may need 2PC

2PC in practice
  some unfortunate properties:
    servers have to hold reservations/locks until abort/commit
    coordinator has to be highly available (otherwise servers wait forever)
  rarely used across systems (like our airline/hotel/bank example)
    cross-system dependencies are not a great design
    e.g., airline cannot sell seats if third-party coordinator is slow/failed
    typically cross-system operations rely on undo/compensation/reconciliation
    e.g., airline allows free ticket cancelation within 24 hours
  sometimes used within a single distributed system
    e.g., spanner: data stored in shards, 2PC used for cross-shard transactions
      coordinator can be made highly available using paxos
    lab4 also has a sharded key/value store
      but does not involve 2PC
      all client operations involve just one key (put/get)
      just one shard/server affected by a given operation

what about ordering of distributed transactions?
  suppose we have a bank with accounts stored on two shards / servers
  transaction 1: xfer $100 from A's account to B's account (different shards)
  transaction 2: compute sum of A's and B's accounts
  possible problem: T2 reads A, then T1 xfers $100, then T2 reads B
    sum includes the $100 twice
  goal: before-or-after atomicity
    "as if" entire transaction 1 ran before (or after) transaction 2
  traditional solution: two-phase locking (2PL)
    lock every object accessed in a transaction
    release all locks after transaction commits or aborts
  when T2 reads A, it acquires A's lock
    prevents T1 from acquiring A's lock
    T2 finishes reading B's account, then releases A's lock
    now T1 can continue executing
  opposite order enforced when T1 acquires A's lock first
  achieves linearizability (or "strong serializability")
    as if transactions ran serially, one after another
    if T1 finished before T2 started, then T1 will appear to run before T2

2PC/2PL achieve strong guarantees for distributed transactions
  all-or-nothing atomicity
  before-or-after atomicity
  nice to have (easy to program with), but costly (performance)
    coordinator must contact all servers: high latency

lynx: case study of how to reconcile this tension in distributed transactions
  sharded database
    SQL-like relational table structure
    rows in a single table might be spread over many machines / data centers
  configuration service tells clients what server maintains what shard
    much like lab4
  why shard the database?
    capacity: users cannot fit on single server
    locality: users don't want to wait for round-trip to single data center
      place user's data in data center near that user's physical location
  ambitious goals:
    distributed transactions
    all-or-nothing atomicity
    before-or-after atomicity
    low latency
  why atomic cross-shard transactions?
    auction application: atomically place bid on item
      affects multiple tables: Bids, Items, Users, ..
    social networking application: atomically add friend
      affects friends table, activities table (announce new friend)
  research system
    prototype exploring ideas
    no real deployment
    smaller-scale evaluation, compared to Spanner
    but nonetheless design contains some interesting ideas
  some aspects not implemented
    failure recovery
    fallback to full 2PC/2PL
    shard-server mapping is static

example application: auction [ based on Figures 1, 5, 7 ]
  Bids table: bidder, seller, itemid, price
    sharded by bidder
  Items table: seller, itemid, price
    sharded by seller

what does a distributed transaction look like in lynx?
  transaction chain: sequence of local operations (hops) on individual servers
    each local operation executes atomically (e.g., locks the database)
    first hop acts as the coordinator for the rest of the transaction
    executes it hop-by-hop on appropriate servers
  e.g., placing a bid [ based on Figure 7 ]
    insert a bid into the Bids table
    update price in the Items table (if new bid is higher)

how does lynx achieve low latency?
  don't wait for all servers involved in transaction: so, not 2PC/2PL
  only the first participant can choose to abort (others cannot)
  the before-or-after atomicity guarantee is slightly weaker

what if another transaction accesses the same data, concurrent with place_bid?

scenario: place_bid || (look up current list of bids by a given user (bidder))
  no problem: place_bid returns after first hop commits
  reading Bids table will return new bid
  if lookup runs before place_bid, does not observe new bid

scenario: place_bid || (look up current price of an item)
  maybe the second hop of place_bid hasn't executed yet
  may observe the old price in Items table, even though Bids was updated

why is this legal?
  serializability: execution is equivalent to some serial order
  this serial order need not be the order in which transactions were issued
  re-ordering: later transaction "appears" to run first
  not linearizability / strict serializability (as achieved by 2PL)

what to do about this problem?
  lynx: "read-my-writes" guarantee
    chain guaranteed to observe prior writes from same session
  can still observe inconsistency: two users talk to each other out-of-band
    does this matter?
    how big can this inconsistency window get?
      potentially arbitrarily large
      if intermediate hop server/network is slow, chain will get stuck there

suppose Alice places bid, checks item's price, then tells Bob about item
  will Bob see the new price?
  yes, because:
    read-my-writes ensures Alice's read will run after Alice's write to Items
    after this, any read from Bob will also observe new item price

scenario: place_bid || (look up item price (Items table),
                        then find highest bid (Bids table))
  should return the same thing, if txns have before-or-after atomicity
  naively running this transaction chains, could observe atomicity violation
    first hop: saw old price in Items table
    second hop: saw new, higher bid in Bids table

how does lynx deal with this atomicity violation?
  analyze transactions for conflicts
    assumes all transactions are known ahead of time (why is this reasonable?)
    but does not require knowing the arguments to transactions
  consider all pairs of transaction, and the hops in those transactions
    S (sibling) edges between hops: hops are part of the same txn
    C (conflict) edges between hops: hops conflict
  is a "conflict" the same thing in lynx and epaxos?
    yes, same notion: order of operations matters
  SC analysis helps determine if it's safe to run hops individually (no 2PL)
    problem if there exists an SC cycle (cycle containing both S & C edges)
    e.g.: reading Items table: no SC cycle, serializable if run hop-by-hop
    e.g.: reading Items then Bids: SC cycle, not serializable if run hop-by-hop

what to do if there's an SC cycle?
  in principle, paper says could use 2PC/2PL
  in practice, lynx did not implement 2PC/2PL
  application developer should try hard to avoid these situations
  e.g., two place_bid invocations
    SC cycle: might require 2PC/2PL
    authors suggest Bids insert does not conflict: commutative

is it always sufficient to wait for the first hop?
  application-dependent
  example from section 4.3: changing privacy settings across many shards
    don't want subsequent posts to appear before privacy setting change
    wait for entire privacy-changing chain to complete

is it necessary to wait for even the first hop to commit?
  should be enough if first hop simply records the transaction chain
    acts as coordinator for the entire chain
  lynx application cannot tell whether server is doing this
    all txns on that hop will execute in order
    later txns will wait for prior txns to finish executing

derived tables
  might want to maintain another view of the same table
  why?
    different sharding plan
    efficient lookups by different key
  e.g., Bids table sharded by seller instead of bidder, or by time
  when base table updated, lynx creates "sub-chains" to update derived tables

why is origin ordering necessary?
  avoids SC cycle problems due to updating multiple derived tables
  e.g., updating Bids-by-seller, Bids-by-time, etc

how is origin ordering implemented?
  first hop is in charge of executing all later hops
  first hop server maintains counters for messages sent to other servers
  when new chain arrives at first hop, counter values assigned to all hops
  other servers execute incoming messages in sequential order
    if server received a hop out-of-order, will wait
  e.g., above example with Bids-by-seller, Bids-by-time

how is read-my-writes implemented?
  per-session origin (first hop of txns in that session) + origin ordering
  session's first hop (origin) sees all transaction chains from that session
  ensures all of their hops execute in order they arrive at first hop

can we get linearizability by putting everyone in the same session?
  read-my-writes will ensure linearizability
  but we will lose scalability: single session's server will be the bottleneck

derived tables
  sub-chains to update the derived tables
  why wait for entire derived table update sub-chain to complete?
  what if another chain reads derived table instead of primary table?
    could get old data in derived table, even though primary table updated
    to detect inconsistency, would need an SC cycle -- analysis would catch
    otherwise, at most a linearizability violation (but still serializable)

when might transaction chains work well?
  need all-or-nothing atomicity
  need serializability
  don't need linearizability
  transactions are relatively simple / conflict-free (no SC cycles)

when might transaction chains not work well?
  don't know transactions ahead of time (cannot analyze for SC cycles)
  need linearizability
  lots of chains converge on a single non-first-hop server
    first-hop servers queue up lots of pending chains
    all of them block waiting to execute hop on bottleneck server

Spanner vs. Lynx
  is Spanner strictly better?
    simpler to program, but higher latency for cross-shard transactions
  how important is this?
    depends entirely on the application
  how would you implement auction in spanner?
    option 1: transaction modifying Bids and Items
      wait to contact both bidder's and seller's data centers
      might be far away: australia, etc
      so, high latency
    option 2: separately modify Bids and Items
      no all-or-nothing atomicity: might insert bid, crash before updating price
      no before-or-nothing atomicity: might have unexpected interleavings
      so, difficult to get consistency
  with lynx, somewhat easier to achieve low latency and consistency
    but does require programmer to think a bit more about consistency

references:
  http://en.wikipedia.org/wiki/Two-phase_commit_protocol
  http://www.youtube.com/watch?v=-5pLcTaDYEg