6.824 2014 Lecture 16: PNUTS
============================

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel
Weaver and Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving
Platform. Proceedings of VLDB, 2008.

Going to talk about PNUTS today.  
  Distributed data storage at Yahoo.
  Published after Dynamo (tomorrow), well before Spanner  
  [ PNUTS: Platform for Nimble Universal Table Storage ]

what is PNUTS' overall goal?
  [world, browsers, data centers]
  overall story similar to that of Spanner
  10s of data centers ("sites") all over the world; geographically diverse
  web applications, e.g. mail, shopping, social net
    each app probably runs at all sites
  PNUTS keeps state for apps
    per-user: profile, shopping cart, friend list
    per-item: book popularity, news articles
  App might need any piece of data at any data center
  Need to handle lots of concurrent updates to different data
    e.g. lots of users must be able to update profile at same time
    thus 1000s of PNUTS servers
  Must cope well with partial failures
    1000s of servers => crashes must be frequent

Overview
  [diagram: 2 datacenters, user browser, web apps, PNUTS generic boxes, SF and Virgina]
  Copy of data at each site
  Geographic distribution
  How to keep copies of data synchronized?
  Sites need to apply same updates

Major Interesting Points
  Copy of data everywhere, so
  Try to be local for all reads and writes
    Reads and writes are fast and asynchronous
    Don't have to do cross-datacenter traffic
  Not serializability -- consistency model interesting.  
    Can see stale data
    App controls semantics and hence performance

Data model
  table
    record  [ key attr      ]
      primary key + attributes (columns)
      Blobs, application specific data (parsed JSON)
  Queries only by primary key, which must be unique
  Also supports range scan on primary key
  Reads/writes can probably be by column
    So a write might replace just one column, not whole record
  Application reading and writing at one site (datacenter)

Consistency Semantics
  per-record timeline consistency
  Version # stored in each record
  [ key attr version    ]
  Example:  Alice, mom, photos, spring break
    Alice does permission change from app in SF
    Alice does photo upload from app in Virginia
    Update1 and Update2.  Cannot see just Update2, no matter where Mom is.

More consistency semantics
  ONLY per record.  Alice's stuff must all be in same record.
  Nothing guaranteed across records (example if it was different records being updated)
    Mom can see just U2
  Can read stale data

Interesting choice
  Contrast with serializability:  Is this feasible for applications?
  Certainly while I'm updating and reading my profile, it's ok if I see a stale version of yours.
  OK for Mom to see old data, just shouldn't see pictures without permission change

How to do this?
  App server gets web request, needs to write data in PNUTS
  Need to update every site!
  Why not just have app logic send update to every site?
    What if app crashes after updating only some sites?
    What if concurrent updates to same record?

PNUTS has a "record master" for each record
  all updates must go through that site
    each record has a hidden column indicating site of record master
    draw key:  [ key attrs version RM   ]
  Cool part:  This can change online

Architecture of a single datacenter
  [diagram: Fill in: tablet ctlrs, routers, storage units, YMBs]
  Each table partitioned by key over storage units
    Table chunks called "tablets", correspond to a range of keys from table
      ~1GB tablet size
    Storage unit stores many tablets
    Tablet controller + routers know the partition plan
    Routers stateless, controller has real copy.  Primary/Backup like Lab2
  YMB is this other service, slightly mysterious
    In charge of:
      1. Logs writes to disk (commit point)
      2. Ordering
      3. Sending updates to other replicas
    Likely API pieces:
      Messages are addressed to topics (e.g., "/table7/tablet23")
      YMB keeps track of all clients, and the topics they subscribe to
      Clients have a list of pending messages; must explicitly ACK them
  How a write happens:
    responsible storage unit (master) executes updates one at a time per record
    tells MB to broadcast update to all sites

So the complete update story (some guesswork):
  app wants to update some columns of a record, knows key.  
  (Update Alice's record in VA, record master in SF; SI indicates site)
  1. app sends key and update to local SU1 (VA)
  2. SU1 looks up record master for key: SI2 (VA)
  3. SU1 sends update request to router at SI2 (VA->SF)
  4. router at SI2 forwards update to local SU2 for key (SF)
  6. SU2 sends update to local Message Broker (MB) (SF)
  7. MB stores on disk + backup MB, sends vers # to original app (SF->VA)
     [ how does MB know the vers #? maybe SU2 told it
       or perhaps SU2 (not MB) replies to original app ]
  8. MB sends update to router at every site (SF->VA)
  9. every site updates local copy (VA)

What about records that don't have a master yet (new inserts)?
  Can we insert in local datacenter?
  Problem: need to ensure we don't end up with two record w/ same key.
  Tablet master: initial master for all inserts.

Puzzles:
  paper says MB is commit point
  does SU2 perform update before or after talking to MB?
    what if SU2 crashes at a bad time?
    maybe they are serious that they never recover a crashed SU?
  who assigns version #, and when?
    if MB: how does it know the version#?
    if SU: what if tablet controller can't talk to SU, re-assigns tablet?
      old SU might still think it's a master.
      publishes MB updates with interleaved version #s.
  who replies to the app?

Writes seem like they'd be slow -- why does it make sense? (VA->SF->VA)
  Hopefully record master is local. 
    On average, 85% of a record's writes come from one site.  Pick the right site.
    PNUTS smart enough to move it around!
  MB distribution is async (app does not wait)
  down side: readers at non-master sites may see stale data

How does a read-only query execute?
  Multiple kinds of reads (section 2.2) so that **Application controls consistency semantics**
  how does each work?
  why is each needed?
  read-any(k)
    read from local SU
    might return stale data (even if you just wrote!)
    why: fast!  local!
  read-critical(k, required_version)
    maybe read from local SU if it has vers >= required_version
    otherwise read from master SU
    why: reflects what app has already seen or written; maybe fast
  read-latest(k)
    probably always read from master SU (? "if local copy too stale")
    slow if master is remote!
    why: app needs fresh data

Problem: What if you need to increment a counter stored in a record?
  Counter X.  SF trying to increment, Virgina as well.
  app reads old value, increments locally, writes new value
  what if the local read produced stale data?
  what if read was OK, but concurrent updates?

test-and-set-write(version#, new value) gives you atomic update to one record
  master rejects the write if current version # != version#
  so if concurrent updates, one will lost and retry 
    while(1):
      (x, ver) = read-latest(k)
      if(t-a-s-w(k, ver, x+1))
        break
  "Optimistic concurrency control" (OCC)
    Tentatively assume things will work out, check at the end
  Problem:  SF could get starved.  Virginia could just keep updating.  
    Describe how (Virginia constantly pre-empting, SF failing t-a-s)    

The Question
  how does PNUTS cope with Example 1 (page 2)
  Initially Alice's mother is in Alice's ACL, so mother can see photos
  1. Alice removes her mother from ACL
  2. Alice posts spring-break photos
  could her mother see update #2 but not update #1?
    really, could mother's app server see updates in wrong order
    esp if mother uses different site than Alice
  ACL and photo list must be in the same record
    since PNUTS guarantees order only for updates to same record
  Alice sends updates to her record's master site in order
    master site broadcasts via MB in order
    MB tells other sites to apply updates in order
  What if Alice's mother:
    reads the old ACL, that includes mother
    reads the new photo list
    answer: just one read of Alice's record, has both ACL and photo list
      if record doesn't have new ACL, order says it can't have new photos either

What if we want to track friends in a social application (e.g., Twitter)?
  Proposal 1: one record per user, has a list of followers in it
  Proposal 2: one record per user pair ("friendship"), named by (user1,user2)
  What's the difference?  Which design makes more sense?
    Paper proposes design 2.

What if we wanted to do bank transfers?
  from one account (record) to another
  can t-a-s-w be used for this?
    Seems tricky.  Not in the obvious way (with 2 records, Alice, Bob)
    not in any direct way (but maybe to update a log)
  nothing like 2pc for updating multiple records atomically
  multi-record updates are not atomic
    other readers can see intermediate state
    other writers are not locked out
    writer might crash after applying only part of an update
  multi-record reads are not atomic
    might read one account before xfer, other account after xfer

Difference here from Spanner.

Is lack of general transactions a problem for web applications?
  maybe not, if programmers know to expect it.  Seems ok for the Alice/Mom problem.

What about tolerating failures?
  want to keep going even if some parts are broken

The main players are
  app servers
  storage units
  YMB
  a record's master (owning site, SU)
  routers
  tablet controller

App server crashes midway through a set of updates
  not a transaction, so only some of writes will happen
  but master SU/MB either did or didn't get each write
    so each write happens at all sites, or none

SU crashes:  paper indicates it doesn't come back.
  Did it make it to YMB?
    If so, write went through.  But client possibly got an error.  
      It's the client's job to check, and deal with duplicate writes.
    If not, write didn't happen.
  How does the tablet controller re-assign tablets to other servers?
    Paper doesn't say
    Might need to be sure that SU is really down, not just partitioned,
      so it doesn't publish any more YMB updates for records it owns
  How does the new SU recover state?
    For each tablet:
      Subscribe to YMB messages for that tablet
      Find SU responsible for this tablet in another data center
      Ask it to send a copy of the state
      Publish a "checkpoint" message via YMB (in that other data center)
      Wait for remote replica to send state copy, up to "checkpoint" msg
      Resume applying YMB messages since the "checkpoint" message as usual

MB crashes after accepting update
  logs to disks on two MB server before ACKing
  recovery looks at log, (re)sends logged msgs
  record master may re-send an update if MB crash before ACK
    record version #s will allow SUs to ignore duplicate

MB is a neat idea
  atomic: updates all replicas, or none
    so failure of app srvrs isn't a problem
  reliable: keeps trying, to cope with temporarily SU/site failure
  async: apps don't have to wait for write to complete, good for WAN
  ordered: keeps replicas identical even w/ multiple writers

Record's master site loses network connection
  Can other sites designate a replacement RM?
    no: original RM may still be processing updates
    don't want *two* RMs!
  Do other sites have to wait indefinitely?
    Seems like other sites could continue to serve stale data for reads, and process writes for other data
  this is what the end of section 2.2 is about -- Dynamo envy

how to change record's master if no failures?
  e.g. Alice moves from SF->East Coast
  Update the record, via old master
    since ID of master site is stored in the record
  A few subsequent updates might go to the old master
    it will reject them, app retries and finds new master

Router failure
  Stateless
  Just start caching mappings from tablet controller after recovery

Tablet controller failure
  Primary/backup like lab2
  No discussion of what update requests it accepts (if any)

Evaluation focuses on latency and scaling

5.2: time for an insert while busy
  depends on how far away Record Master is
  RM local: 75.6 ms
  RM nearby: 131.5 ms
  RM other coast: 315.5 ms

what is 5.2 measuring? from what to what?
  maybe web server starts insert, to RM replies w/ new version?
  probably not time for MB to propagate to all sites
    since then local RM wouldn't be < remote

Why 75 ms?

Is it 75 ms of network speed-of-light delay?
  no: local

Is the 75 ms mostly queuing, waiting for other client's operations?
  no: they imply 100 clients was max that didn't cause delay to rise

End of 5.2 suggests 40 ms of 75 ms in in SU
  how could it take 40 ms?
    each key/value is one file?
    creating a file takes 3 disk writes (directory, inode, content)?
  what's the other 35 ms?
    YMB logging to disk?

But only 33 ms (not 75) for "ordered table" (MySQL/Innodb)
  closer to the one disk write we'd expect

5.3 / Figure 3: effect of increasing request rate
  what do we expect for graph w/ x-axis req rate, y-axis latency?
    system has some inherent capacity, e.g. total disk seeks/second
    for rates less than that, constant latency
    for rates higher than that, growing queues, divergent average latency
  blow-up should be at max capacity of h/w
    e.g. # disk arms / seek time
  we don't see that in Figure 3
    end of 5.3 says clients too slow
    at >= 75 ms/op, 300 clients -> about 4000/sec
  text says max possible rate was about 3000/second
    that's higher than 1300 from section 5.2 -- why?
    probably 5.3 has lots of reads as well as writes

Later papers / posts describe how PNUTS evolved
  http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6060790
  http://www.cs.georgetown.edu/~wzhou/publication/sherdoop-sigmod11.pdf
  https://developer.yahoo.com/blogs/ydn/sherpa-grows-scales-2011-50931.html

No papers about YMB, but it's similar to other pub-sub systems:
  https://cwiki.apache.org/confluence/display/BOOKKEEPER/HedWig

stepping back, what were PNUTS key design decisions?
  1. async replication
     fast reads, but stale
     fast writes if near master, but not visible for a while
  2. Per record timeline consistency
       relaxed consistency / no transactions
       App controls staleness
  3. primary-backup replication: sequence all writes thru master site
     pro: keeps replicas identical,
          enforces serial order on updates,
          easy to reason about
     con: no progress if master site disconnected

Next: Dynamo, a very different design
  async replication, but no master
  eventual consistency
  always allow updates
  tree of versions if network partitions
  readers must reconcile versions