6.824 2014 Lecture 4: FDS Case Study

Flat Datacenter Storage
Nightingale, Elson, Fan, Hofmann, Howell, Suzue
OSDI 2012
    
why are we looking at this paper?
  Lab 2 wants to be like this when it grows up
    though details are all different
  fantastic performance -- world record cluster sort
  good systems paper -- details from apps all the way to network

what is FDS?
  a cluster storage system
  stores giant blobs -- 128-bit ID, multi-megabyte content
  clients and servers connected by network with high bisection bandwidth
  for big-data processing (like MapReduce)
    cluster of 1000s of computers processing data in parallel

high-level design -- a common pattern
  lots of clients
  lots of storage servers ("tractservers")
  partition the data
  master ("metadata server") controls partitioning
  replica groups for reliability

why is this high-level design useful?
  1000s of disks of space
    store giant blobs, or many big blobs
  1000s of servers/disks/arms of parallel throughput
  can expand over time -- reconfiguration
  large pool of storage servers for instance replacement after failure

motivating app: MapReduce-style sort 
  a mapper reads its split 1/Mth of the input file (e.g., a tract)
    map emits a <key, record> for each record in split
    map partitions keys among R intermediate files  (M*R intermediate files in total)
  a reducer reads 1 of R intermediate files produced by each mapper
    reads M intermediate files (of 1/R size)
    sorts its input
    produces 1/Rth of the final sorted output file  (R blobs)
  FDS sort
     FDS sort does not store the intermediate files in FDS
     a client is both a mapper and reducer
     FDS sort is not locality-aware
        in mapreduce, master schedules workers on machine that are close to the data
        e.g.,  in same cluster
     later versions of FDS sort uses more fine-grained work assignment
       e.g., mapper doesn't get 1/N of the input file but something smaller
       deals better with stragglers   
    
The Abstract's main claims are about performance.
   They set the world-record for disk-to-disk sorting in 2012 for MinuteSort
      1,033 disks and 256 computers (136 tract servers, 120 clients)
      1,401 Gbyte in 59.4s

Q: does the abstract's 2 GByte/sec per client seem impressive?
   how fast can you read a file from Athena AFS? (abt 10 MB/sec)
   how fast can you read a typical hard drive?
   how fast can typical networks move data?

Q: abstract claims recover from lost disk (92 GB) in 6.2 seconds
   that's 15 GByte / sec
   impressive?
   how is that even possible? that's 30x the speed of a disk!
   who might care about this metric?

what should we want to know from the paper?
  API?
  layout?
  finding data?
  add a server?
  replication?
  failure handling?
  failure model?
  consistent reads/writes? (i.e. does a read see latest write?)
  config mgr failure handling?
  good performance?
  useful for apps?

* API
  Figure 1
  128-bit blob IDs
  blobs have a length
  only whole-tract read and write -- 8 MB

Q: why are 128-bit blob IDs a nice interface?
   why not file names?

Q: why do 8 MB tracts make sense?
   (Figure 3...)

Q: what kinds of client applications is the API aimed at?
   and not aimed at?

* Layout: how do they spread data over the servers?
  Section 2.2
  break each blob into 8 MB tracts
  TLT maintained by metadata server
    has n entries
    for blob b and tract t, i = (hash(b) + t) mod n
    TLT[i] contains list of tractservers w/ copy of the tract
  clients and servers all have copies of the latest TLT table

Example four-entry TLT with no replication:
  0: S1
  1: S2
  2: S3
  3: S4
  suppose hash(27) = 2
  then the tracts of blob 27 are laid out:
  S1: 2 6
  S2: 3 7
  S3: 0 4 8
  S4: 1 5 ...
  FDS is "striping" blobs over servers at tract granularity

Q: why have tracts at all? why not store each blob on just one server?
   what kinds of apps will benefit from striping?
   what kinds of apps won't?

Q: how fast will a client be able to read a single tract?

Q: where does the abstract's single-client 2 GB number come from?

Q: why not the UNIX i-node approach?
   store an array per blob, indexed by tract #, yielding tractserver
   so you could make per-tract placement decisions
     e.g. write new tract to most lightly loaded server

Q: why not hash(b + t)?

Q: how many TLT entries should there be?
   how about n = number of tractservers?
   why do they claim this works badly? Section 2.2

The system needs to choose server pairs (or triplets &c) to put in TLT entries
   For replication
   Section 3.3

Q: how about
   0: S1 S2
   1: S2 S1
   2: S3 S4
   3: S4 S3
   ...
   Why is this a bad idea?
   How long will repair take?
   What are the risks if two servers fail?

Q: why is the paper's n^2 scheme better?
   TLT with n^2 entries, with every server pair occuring once
   0: S1 S2
   1: S1 S3
   2: S1 S4
   3: S2 S1
   4: S2 S3
   5: S2 S4
   ...
   How long will repair take?
   What are the risks if two servers fail?

Q: why do they actually use a minimum replication level of 3?
   same n^2 table as before, third server is randomly chosen
   What effect on repair time?
   What effect on two servers failing?
   What if three disks fail?

* Adding a tractserver
  To increase the amount of disk space / parallel throughput
  Metadata server picks some random TLT entries
  Substitutes new server for an existing server in those TLT entries

Q: how long will adding a tractserver take?

Q: what about client writes while tracts are being transferred?
   receiving tractserver may have copies from client(s) and from old srvr
   how does it know which is newest?

Q: what if a client reads/writes but has an old tract table?
  
* Replication
  A writing client sends a copy to each tractserver in the TLT.
  A reading client asks one tractserver.

Q: why don't they send writes through a primary?

Q: what problems are they likely to have because of lack of primary?
   why weren't these problems show-stoppers?

* What happens after a tractserver fails?
  Metadata server stops getting heartbeat RPCs
  Picks random replacement for each TLT entry failed server was in
  New TLT gets a new version number
  Replacement servers fetch copies

Example of the tracts each server holds:
  S1: 0 4 8 ...
  S2: 0 1 ...
  S3: 4 3 ...
  S4: 8 2 ...

Q: why not just pick one replacement server?

Q: how long will it take to copy all the tracts?

Q: if a tractserver's net breaks and is then repaired, might srvr serve old data?

Q: if a server crashes and reboots with disk intact, can contents be used?
   e.g. if it only missed a few writes?
   3.2.1's "partial failure recovery"
   but won't it have already been replaced?
   how to know what writes it missed?

Q: when is it better to use 3.2.1's partial failure recovery?

* What happens when the metadata server crashes?

Q: while metadata server is down, can the system proceed?

Q: is there a backup metadata server?

Q: how does rebooted metadata server get a copy of the TLT?

Q: does their scheme seem correct?

* Random issues

Q: is the metadata server likely to be a bottleneck?

Q: why do they need the scrubber application mentioned in 2.3?
   why don't they delete the tracts when the blob is deleted?
   can a blob be written after it is deleted?

* Performance

Q: how do we know we're seeing "good" performance?
   what's the best you can expect?

Q: limiting resource for 2 GB / second single-client?

Q: Figure 4a: why starts low? why goes up? why levels off?
   why does it level off at that particular performance?

Q: Figure 4b shows random r/w as fast as sequential (Figure 4a).
   is this what you'd expect?

Q: why are writes slower than reads with replication in Figure 4c?

Q: where does the 92 GB in 6.2 seconds come from?
   Table 1, 4th column
   that's 15 GB / second, both read and written
   1000 disks, triple replicated, 128 servers?
   what's the limiting resource? disk? cpu? net?

How big is each sort bucket?
  i.e. is the sort of each bucket in-memory?
  1400 GB total
  128 compute servers
  between 12 and 96 GB of RAM each
  hmm, say 50 on average, so total RAM may be 6400 GB
  thus sort of each bucket is in memory, does not write passes to FDS
  thus total time is just four transfers of 1400 GB
    client limit: 128 * 2 GB/s = 256 GB / sec
    disk limit: 1000 * 50 MB/s = 50 GB / sec
  thus bottleneck is likely to be disk throughput