CS-522: Principles of Computer Systems (Fall 2019)
A modern computer system spans many layers: applications, libraries, operating systems, networks, and hardware devices. Building a good system entails making the right trade-offs (e.g., between performance, durability, and correctness) and understanding emergent behaviors—the difference between great system designers and average ones is that the really good ones make these trade-offs in a principled fashion, not by trial-and-error. In this course, we identify some of the key principles underlying successful systems, and learn how to solve problems in computing using ideas, techniques, and algorithms from operating systems, networks, databases, programming languages, and computer architecture. The basic courses on these topics teach how the elemental parts of modern systems work; POCS picks up where those courses leave off and focuses on how the pieces come together to form useful, efficient systems.
This course is targeted primarily at students who wish to acquire a deep understanding of computer system design or pursue research in systems. It is an intellectually challenging, fast paced course, in which survival requires a good background in operating systems, databases, networking, programming languages, and computer architecture. Please see the syllabus for more information.
We assign readings for each principle we cover; these are typically classic Computer Systems papers that embody the principle and have stood the test of time. For those who need additional material to compensate for a lack of background, we recommend the textbook Principles of Computer System Design: An Introduction by J. H. Saltzer and M. F. Kaashoek which is available in the EPFL library. Chapters 7 - 11 are also freely available online as a PDF. You can get the Kindle version or a 20th-century-style paperback for your own library.
We hold in-class interactive sessions on Tuesdays 12:15-14:00 and Thursdays 13:15-15:00, both in INM10. Some of these sessions take the form of classic lectures, while others are recitations in which we discuss the week’s readings. The goal of all in-class sessions is to understand in depth the principle(s) of that week, and the connection between it and the concrete instantiations in the assigned readings.
We regularly assign 1-page writeups called one-pagers (OPs); they are due Wednesdays at 10pm. See details here.
Your grade in the class is broken down approximately as follows: 40% OPs, 40% final, 10% presentations, 10% participation and contribution to the in-class discussions.
POCS is a heavyweight course carrying 7 units of ECTS credit (according to the Conférence universitaire suisse, this means 210 learning hours/semester, i.e., 15 hours/week). This course is meant primarily for students who intend to pursue research in the area of systems, therefore you must have a solid systems background. One way to acquire this background is, for example, by taking at least the following:
- CS-208: Computer architecture
- COM-208: Computer networks
- CS-320: Computer language processing
- CS-323: Introduction to operating systems
- CS-322: Introduction to database systems
Without a solid systems background, it is hard to succeed in POCS. If you wish to brave it out despite an incomplete background, please be ready to spend at least 2x more time than the other students in order to acquire, on the side, the necessary background on your own.
We encourage you to discuss the reading materials with your peers, but every assignment you turn in must be your own work. You are not permitted to discuss the topic of the OP with other students prior to your or their final submission. Cheating, plagiarism, and any form of dishonesty will be handled with maximum severity. If you are ever in doubt about whether an action on your part may constitute unacceptable collaboration, please ask the course staff before proceeding—doing so afterward is too late.
OP6 (due 12-Dec @ 22:00)
Human operator error is an important source of downtime in computer systems. One potential solution is to keep operators away from the systems as much as possible. Unfortunately, today’s software is unable to function on its own—it needs to be configured, recovered, upgraded, etc.
Imagine a world where most computing occurs in zero-administration environments: zero-operator data centers, wireless motes spread in the jungle, etc. An important requirement for software running on zero-administration compute nodes is that it be able to self-recover from failures. Identify one (or maybe two) fundamental properties that software will need to have in order to achieve this goal of being able to self-recover fully autonomously. Describe a concrete vision of how that property can be achieved.
Feel free to go beyond the material we covered in class. No fluff please!
OP5 (due 20-Nov @ 22:00)
System designers and administrators are constantly faced with the challenge of simultaneously achieving high performance and high dependability in the software they build or run. For example, a database can checkpoint more or less often, depending on what tradeoffs the DBA wants to make between steady-state performance and availability (i.e., recovery performance).
Choose a component X of dependability (e.g., reliability, availability, security, safety) and describe a quantitative method that would serve a system architect in achieving the best possible mix of performance and X. Take a specific type of system (e.g., filesystem) as an example, and show what you would measure and how you would combine the results to evaluate whether one instance (e.g., ext4) of the system achieves a better mix than another instance (e.g., ZFS) of the same type of system.
Your analysis should be insightful, so please avoid stating the obvious: ideally, nobody else will have thought of your line of argument. Having one, in-depth brilliant idea is better than being comprehensive but banal. Feel free to go beyond the material we covered in class. Note that this OP is not about designing a system but rather about designing a method, or “thinking framework.”
OP4 (due 6-Nov @ 22:00)
In class we defined bandwidth flooding and discussed reactive solutions, which block flooding traffic after the attack has started and has been detected.
For this OP, propose a proactive solution, which prevents bandwidth flooding from happening in the first place. Describe the main components of your solution: who does what, how much state is needed, who stores that state, and (if room) what are the incentives behind your solution. Alternatively, you can argue that no proactive solution may exist.
You are free to propose a clean-slate solution, e.g., force all end-systems to use a new piece of hardware or a new TCP/IP stack. However, if you do that, you need to pay particular attention to the incentives (who would convince end-systems to deploy the new stuff and how).
OP3 (due 11-Oct @ 22:00)
For this OP, we ask you to produce a 2-page PDF, as follows:
On page 1: Consider Lampson’s global name service discussed in class. Describe in one paragraph (100 words or less) what kind of fault tolerance this system achieves and how, and what kind of data consistency this system provides to its users and how.
On page 2: Describe in one sentence the one key idea put forth by the Exokernel.
OP2 (due 4-Oct @ 22:00)
For this OP, we ask you to produce a 2-page PDF, as follows:
On page 1: Consider Lampson’s global name service discussed in class. Describe in your own words what kind of fault tolerance this system achieves and how, and what kind of data consistency this system provides to its users and how. (Please follow the standard rules for writing this).
On page 2: Describe in one paragraph (100 words or less) the one key idea put forth by the Exokernel. Support the description of this idea as well as you can. Make sure the paragraph is self-contained.
OP1 (due 27-Sep @ 22:00)
Describe in your own words the one key idea put forth by the Exokernel. Support the description of this idea with points/facts from the Exokernel paper. Make it crystal clear which the key idea is. Please follow the rules below “to the letter.”
These are the rules for an OP submission. Not respecting these rules will be severely penalized.
Use a maximum of 500 words for the body of your writeup; in the words of Antoine de Saint-Exupéry, “perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.” OPs require in-depth consideration of the assigned materials, along with good technical writing. Spellcheck.
Your submission must be in PDF and consist of one single-spaced A4 page, including all figures, tables, and references. References should have complete citations at the bottom of the OP, which in turn are hyperlinked to electronic versions of the cited materials, whenever possible. The OP should be single-column and use Times Roman (or equivalent serif) font with 10-point type or greater. The title and references do not count toward the 500-word limit.
Submit your OP through the POCS submission system by the indicated due date and time. The submission’s title must be of the form “OP_n_: Title”, where n is the single-digit number of the OP, e.g., “OP1: Exokernel summary” for the first week’s OP). Do not write your name on the OP.
If you do not have an account on the system, you have to create one. Registration is straightforward: simply enter your email address and choose “I’m a new user and want to create an account using this email address”. The system will shortly send you an email with instructions on how to complete your registration. Once you register, you should fill in the “new paper” form.
You are not allowed to discuss the topic of the OP with anyone else until after the submission deadline. The OP must be 100% your own work.