6.824 2013 Lecture 1: Introduction and lab overview 6.824: Distributed Systems Engineering What is a distributed system? multiple networked cooperating computers Examples: Internet E-Mail, Athena file server, Google MapReduce, Dropbox, etc. Why distribute? to connect physically separate entities to achieve security via physical isolation to tolerate faults via replication at separate sites to increase performance via parallel CPUs/mem/disk/net But: complex, hard to debug new classes of problems, e.g. partial failure (did he accept my e-mail?) Lamport: A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. advice: don't distribute if a central system will work Why take this course? interesting -- hard problems, non-obvious solutions active research area -- lots of progress + big unsolved problems used by real systems -- unlike 10 years ago driven by the rise of big Web sites hands-on -- you'll build a real system in the labs COURSE STRUCTURE http://pdos.csail.mit.edu/6.824 Course components: Lectures about big ideas, papers, labs Readings: research papers as case studies please read papers before class paper for today: MapReduce paper otherwise boring, and you can't pick it up by listening each paper has a question for you to answer and one for you to ask (see web site) submit question&answer before class, one or two paragraphs Mid-term quiz in class, and final exam Labs: build increasingly sophisticated fault-tolerant services First lab is due on Monday Project: design and build a distributed system of your choice or the system we pose in the last month of the course teams of two or three project meetings with course staff demo in last class meeting Christopher Johnson, Yandong Mao, Jonathan Perry are TAs, office hours on Web. MAIN TOPICS Example: a shared file system, so users can cooperate, like dropbox but this lecture isn't about dropbox specifically just an example goal to get feel for distributed system problems lots of client computers [diagram: clients, network, vague set of servers] Topic: architecture Choice of interfaces Monolithic file server? Block server(s) -> FS logic in clients? Separate naming + file servers? Separate FS + block servers? Single machine room or unified wide area system? Wide-area dramatically more difficult. Client/server or peer-to-peer? Interact w/ performance, security, fault behavior. Topic: implementation How do clients/servers communicate? Direct network communication is pretty painful Want to hide network stuff from application logic Most systems organize distribution with some structuring framework(s) RPC, RMI, DSM, MapReduce, &c Topic: performance Distribution can hurt: network b/w and latency bottlenecks Lots of tricks, e.g. caching, threaded servers Distribution can help: parallelism, pick server near client Idea: scalable design Nx servers -> Nx total performance Need a way to divide the load by N == divide the state by N Split by user Split by file name "Sharding" or "partitioning" Rarely perfect -> only scales so far Global operations, e.g. search Load imbalance One very active user One very popular file -> one server 100%, added servers mostly idle -> Nx servers -> 1x performance Topic: fault tolerance Dropbox: ~10,000 servers; some fail http://www.datacenterknowledge.com/archives/2013/10/23/how-dropbox-stores-stuff-for-200-million-users/ Can I use my files if there's a failure? Some part of network, some set of servers Maybe: replicate the data on multiple servers Perhaps client sends every operation to both Maybe only needs to wait for one reply Opportunity: operate from two "replicas" independently if partitioned? Opportunity: can 2 servers yield 2x availability AND 2x performance? Topic: consistency == contract w/ apps/users about meaning of operations e.g. "read yields most recently written value" hard due to partial failure, replication/caching, concurrency Problem: keep replicas identical If one is down, it will miss operations Must be brought up to date after reboot If net is broken, *both* replicas maybe live, and see different ops Delete file, still visible via other replica "split brain" -- usually bad Problem: clients may see updates in different orders Due to caching or replication I make grades.txt unreadable, then TA writes grades to it What if the operations run in different order on different replicas? Consistency often hurts performance (communication, blocking) Many systems cut corners -- "relaxed consistency" Shifts burden to applications LABS focus: fault tolerance and consistency -- central to distrib sys lab 1: MapReduce labs 2/3/4: storage servers progressively more sophisticated (tolerate more kinds of faults) progressively harder too! patterened after real systems, e.g. MongoDB Lab 4 has core of a real-world design for 1000s of servers what you'll learn from the labs easy to listen to lecture / read paper and think you understand building forces you to really understand I hear and I forget, I see and I remember, I do and I understand (Confucius?) you'll have to do some design yourself we supply skeleton, requirements, and tests but we leave you substantial scope to solve problems your own way you'll get experience debugging distributed systems test cases simulate failure scenarios tricky to debug: concurrency and failures many client and servers operating in parallel test cases make servers fail at the "most" inopportune time think first before starting to code! otherwise your solution will be a mess and/or, it will take you a lot of time code review learn from others judge other solutions we've tried to ensure that the hard problems have to do w/ distrib sys not e.g. fighting against language, libraries, &c thus Go (type-safe, garbage collected, slick RPC library) thus fairly simple services (mapreduce, key/value store) Lab 1: MapReduce help you get up to speed on Go and distributed programming first exposure to some fault tolerance motivation for better fault tolerance in later labs motivating app for many papers popular distributed programming framework many descendants frameworks Computational model aimed at document processing split doc -> run Map(k1, v1) on each split -> run Reduce(k2, list(v2)) on each partition -> list(v2) merge result write a map function and reduce function framework takes care of parallelism, distribution, and fault tolerance what computations are not targeted anything that updates a document Example: grep map phase master splits input in M partitions calls Map on each partition map(partition) -> list(k1,v1) search partition for word produce a list with one item if word shows up, nil if not partition results among R reducers reduce phase Reduce job collects 1/R output from each Map job all map jobs have completed! reduce(k1, v1) -> v2 identity function: v1 in, v1 out merges phase: master merges R outputs Performance number of jobs: MxR map jobs how much speed up do we get on N machines? ideal: N bottlenecks stragglers network to collect a Reduce partition network to interact with FS disk I/O Fault tolerance model master is not fault tolerant assumption: this single machine won't fail during running a mapreduce app but many workers, so have to handle their failures assumption: workers are fail stop they fail and stop (e.g., don't send garbled weird packets after a failure) they may reboot What kinds of faults might we want to tolerate? network: lost packets duplicated packets temporary network failure server disconnected network partitioned server: server crash+restart (master versus worker?) server fails permanently (master versus worker?) all servers fail simultaneously -- power/earthquake bad case: crash mid-way through complex operation what happens fail in the middle of map or reduce? bugs -- but not in this course what happens when bug in map or reduce? same bug in Map over and over? management software kills app malice -- but not in this course Tools for dealing with faults? retry -- e.g. if pkt is lost, or server crash+restart packets (TCP) and MR jobs may execute MR job twice! replicate -- e.g. if one server or part of net has failed next labs replace -- for long-term health e.g., worker Retry jobs network falure: oops execute job twice ok for map reduce, because map/reduce produces same output map/reduce are "functional" or "deterministic" how about intermediate files? atomic rename worker failure: may have executed job or not so, we may execute job more than once! but ok for MapReduce as long Map and Reduce function are deterministic what would make Map or Reduce not deterministic? is executing a request twice in general ok? no. in fact, often not. unhappy customer if you execute one credit card transaction several times adding servers easy in MapReduce--just tell master hard in general server may have lost state (need to get new state) server may have rebooted quickly may need to recognize that to bring server up to date server may have a new role after reboot (e.g., not the primary) these harder issues you would have to deal with to make master fault tolerant topic of later labs the lab 1 app (see main/wc.go) stubs for Map and Reduce you fill them out to implement word count (wc) how would you write grep? the lab 1 sequential implementation (see mapreduce/mapreduce.go) demo: run wc.go code walk through start with RunSingle() the lab 1 worker (see mapreduce/worker.go) the remote procedure calls (RPCs) arguments and replies (see mapreduce/common.go) Server side of RPC RPC handlers have a particular signature DoJob Shutdown RunWorker rpcs.Register: register named handlers -- so Call() can find them Listen: create socket on which to listen for RPC requests for distributed implementation, replace "unix" w. "tcp" replace "me" with a tuple name ServeConn: runs in a separate thread (why?) serve RPC concurrently a RPC may block Client side of RPC Register() call() (see common.go) make an RPC lab code dials for each request typical code uses a network connection for several requests but, real must be prepared to redial anyway a network connection failure, doesn't imply a server failure! we also do this to introduce failure scenarios easily intermittent network failures just loosing the reply, but not the request the lab 1 master (see mapreduce/master.go) You write it You will have to deal with distributing jobs You will have to deal with worker failures