CSx55: Distributed Systems

[Schedule]

[Assignments]

Schedule

Last updated on Tuesday, December 2, 2025 3:51 PM

Professor

Lecture Coordinates

Shrideep Pallickara
Office: Room 364, CS Building
Office Hours:
1:00-2:00 pm Friday
[in-person and via Zoom]

E-mail: compsci_csx55 {aT} colostate.edu
(with the obvious change)
Tel: 970.492.4209

TTh: 12:30 - 1:45 pm
Wednesdays: 11:00-11:59 am
[CSB-130]

Graduate Teaching Assistants

Gabriele Maurina

Videep Venkatesha

Jackson Volesky
E-mail: compsci_csx55 {aT} colostate.edu
(with the obvious change)

Readings will be based on the following textbooks.

[TvS]	Distributed Systems: Principles and Paradigms. Andrew S. Tanenbaum and Maarten van Steen. 3nd Edition. Createspace, ISBN 9781530281756.
[CDKB]	Distributed Systems: Concepts and Design. George Coulouris, Jean Dollimore, Tim Kindberg, Gordon Blair. 5th Edition. Addison Wesley. ISBN: 978-0132143011
[KS]	Distributed Computing: Principles, Algorithms, and Systems. Ajay Kshemkalyani and Mukesh Singhal. 1st edition. Cambridge University Press. ISBN: 0521876346/ 978-0521876346.
[GPB]	Java Concurrency in Practice. Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. Addison-Wesley Professional. ISBN: 0321349601/978-0321349606.
[OW]	Java Threads. Scott Oaks and Henry Wong. . 3rd Edition. O’Reilly Press. ISBN: 0-596-00782-5/978-0-596-00782-9
[TW]	Hadoop: The Definitive Guide. Tom White. 3rd Edition. Early Access Release. O’Reilly Press. ISBN: 978-1-449-31152-0.
[KKWZ]	Learning Spark: Lightning-Fast Big Data Analysis. 1st Edition. Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. O'Reilly. 2015. ISBN-13: 978- 1449358624.
[KW]	High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. Holden Karau and Rachel Warren. O'Reilly Media. 2017. ISBN-13: 978-1491943205.
[NL]	Distributed Algorithms. Nancy Lynch. 1st edition. Morgan Kaufman. ISBN: 1558603484/978-1558603486.
[GR]	Cloud Application Architectures: Building Applications and Infrastructure in the Cloud. George Reese.1st edition. O'Reilly. ISBN: 0596156367/978-0596156367.
[PD]	Computer Networks: A Systems Approach. Larry Peterson and Bruce Davie. 4th edition. Morgan Kaufmann. ISBN: 978-0-12-370548-8.
[FS]	Practical Cryptography. Niels Ferguson and Bruce Schneier. 1st edition. Wiley Publishing. ISBN: 0-471-22894-X/0-471-22357-3.
[WS]	Cryptography and Network Security: Principles and Practice. William Stallings. 5th Edition. Prentice Hall. ISBN: 0136097049/978-0136097044
[RR]	Unix Systems Programming. Kay Robbins & Steve Robbins, 2nd edition. Prentice Hall. ISBN: 978-0-13-042411-2.
[SGG]	Operating Systems Concepts. Avi Silberschatz, Peter Galvin, Greg Gagne. 8th edition. John Wiley & Sons, Inc. ISBN-13: 978-0-470-12872-5.


Introduction [CS455 and CS555]			References and HW
This module introduces students to the course, logistics, and the set of topics that are to be covered.			[HW1 released 08/25]
	Objectives: [1] Clarify logistics [2] Clarify expectations for the CS455 and CS555 variants of the course.
	08/26	Lecture 1

Threads, Thread Safety, and Concurrent Programming [CS455 and CS555]			Readings
This module introduces the core ideas of threads, thread safety, and concurrent programming. It begins with a comparison of threads and processes, followed by an exploration of the thread lifecycle: how stacks and heaps are used, and how threads are created and managed. The focus then shifts to the challenges of concurrency. Topics include data synchronization, race conditions, intrinsic locks, and reentrancy. Compound actions, object sharing and confinement, and the preservation of multivariable invariants are examined as essential ingredients of thread safety. Building on these foundations we then turn to techniques for writing correct and reliable concurrent code. This includes making classes thread-safe, extending thread-safe classes with new functionality, and working with synchronized and concurrent collections. Different locking strategies are also analyzed, highlighting their advantages and trade-offs. Ultimately, the goal of this module is to provide a clear framework for understanding both the mechanics and the principles that govern programs running many tasks at the same time.			[OW] Ch {1, 2,3, 4} [SGG] Ch {4} [GPB] Ch {5, 11}
	Objectives: Design thread safe programs that scale and leverage concurrency Reason about program correctness in multi-threaded settings Design scalable server-side programs Implement basic synchronization mechanisms (e.g., locks, synchronized collections) to manage shared resources. Diagnose concurrency bugs by tracing execution and identifying violations of invariants. Compare and critique different locking and synchronization strategies in terms of scalability, fairness, and performance. Develop novel concurrency solutions that balance safety, performance, and maintainability in complex systems.
	08/27 08/28 09/02 09/03 09/04 09/09 09/10 09/11	Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Lecture 7 Lecture 8 Lecture 9	CS455 HW2 (09/10) CS555 HW2 (09/10)

Architectures & Topology [CS455 and CS555]			Readings
This module examines the architectural styles used in designing complex systems. It covers layered architectures, object-oriented designs, data-centric models, and event-driven approaches; we will highlight the strengths and trade-offs of each. We then pivot to system topologies i.e., the patterns by which components are connected, and the ways in which those choices affect throughput, scalability, fault tolerance, resiliency, and latency. This module concludes with a focus on scaling itself: what scaling means in practice, how scalable systems are designed, and how scale can be measured and evaluated.			[TvS] Ch {6} [CDKB] Ch {15} [KS] Ch {9} [TvS] Ch {2}
	Objectives: Explain the role topologies play in scaling and failure recovery. Identify differences between topologies based on random graphs, regular graphs, power law, and small-world networks. Describe how architectural styles (layered, object-based, data-centric, event-driven) influence system design. Compare and critique scaling strategies while highlighting trade-offs between performance, fault tolerance, and complexity.
	09/16 09/17 09/18	Lecture 10 Lecture 11 Lecture 12

Peer-to-Peer Systems & Distributed Hashtables [CS455 and CS555]
This module examines the fascinating world of peer-to-peer (P2P) systems where every machine is a peer that can act both as a client and a server. We will begin by laying out the core characteristics of P2P systems and tracing their evolution across generations. We will identify the middleware and requirements that make such systems possible. Next, we will then compare/contrast structured and unstructured approaches. Structured systems (like Chord, Pastry, and Tapestry) use carefully designed algorithms to organize peers and guarantee efficient lookups. Unstructured systems (like Napster and Gnutella) rely on looser, more ad hoc connections. BitTorrent, with its swarming downloads, shows how these ideas can power real-world applications at scale. We will explore the time and space complexity of P2P algorithms, understanding the trade-offs that emerge when scaling to millions of peers. Another way to think of these exemplar systems is not just as case studies, but as landmarks in the evolution of distributed thinking.			[TvS] Chap {5} [CDKB] Chap {7, 10} [KS] Chap {18} [GPB] Chap {1,2,11} CS555 HW3 (10/29)
	Objectives: Explain structured and unstructured P2P systems Explain bounded routing in structured P2P systems Analyze the trade-offs between structured and unstructured P2P systems in terms of scalability, fault tolerance, and lookup efficiency. Evaluate exemplar P2P systems (e.g., Chord, Pastry, BitTorrent) and justify design choices in terms of routing complexity and performance. Build systems based on structured P2P concepts
	09/23 09/24 09/25 09/30	Lecture 13 Lecture 14 Lecture 15 Lecture 16

Programming models for Cloud Computing: MapReduce [CS455 and CS555]
This module introduces MapReduce, the programming model that reshaped how we think about large-scale data processing. We will start by placing it in context, comparing it with other paradigms such as relational databases, high-performance computing, grid computing, and even volunteer computing projects like SETI@home. Each offers its own methodology for distributing work. But MapReduce brings something distinctive: a simple way to scale computations across thousands of machines. At the heart of MapReduce lies its core architectural framework. Instead of moving massive datasets across a network, we push computations to where the data lives; this preserves locality and avoids costly transfers. Programs are expressed in terms of two high-level functions: (1) Map which breaks the problem into pieces, and (2) Reduce which gathers and summarizes the results. We will also see how the system orchestrates tasks behind the scenes: handling failures, balancing loads, and coordinating across machines. Along the way, we will also explore the roles of partitioning functions, refinements, and combiners, which allow MapReduce programs to run more efficiently and at greater scale. By the end of this module, my hope is that you will not just know how MapReduce works but also that you will be able to think with it. To recognize when it is the right tool, and to design programs that scale gracefully as the data grows.			[MapReduce-Paper] CS455 HW3 (10/29)
	Objectives: Explain the differences between MapReduce and other paradigms (RDBMS, HPC, Grid, volunteer computing). Express computations using the Map and Reduce functions. Apply partitioning, combiner, and refinement functions to improve performance. Analyze the trade-offs of pushing computation to the data versus moving data to the computation. Design systems that leverage the MapReduce framework to scale gracefully with growing datasets.
	10/1 10/2	Lecture 17 Lecture 18

Hadoop MapReduce and HDFS [CS455 and CS555]
This module introduces the Hadoop ecosystem, the open-source framework that brought Google’s ideas about MapReduce and distributed storage into the wider world. We will see how Hadoop weaves together two central components: (1) MapReduce: the programming model that lets you process massive datasets in parallel, and (2) the Hadoop Distributed File System (HDFS) which stores those datasets reliably across thousands of machines. We will look at how to develop MapReduce programs using Hadoop and explore what happens under the hood when a job is configured, submitted, and run. We will trace the data flow through the system, understand how tasks are split across machines, and study the role of combiner functions ... along with the conditions that combiners must satisfy to be valid. Finally, we will look at YARN the resource manager that orchestrates Hadoop’s runtime environment, scheduling jobs and balancing workloads across the cluster. By the end of this module, you will see how all the moving parts (storage, computation, orchestration) work together to make Hadoop a powerful platform for big data.			[TW] Ch {1, 2}
	Objectives: Express programs using MapReduce. Design systems that push computations to the data to preserve data locality. Design programs that leverage the Hadoop framework to scale. Understand the architecture and design principles of the Hadoop Distributed File System (HDFS). Analyze how job configuration, task splitting, and combiners affect performance. Evaluate the role of YARN in managing resources and scheduling jobs.
	10/7 10/8 10/9 10/21 10/22	Lecture 19 Lecture 20 Lecture 21 Lecture 25 Lecture 26

Term Project Pitches and Midterm [CS455 and CS555]
	10/14 10/15 10/16	Midterm


Spark [CS455 and CS555]
Spark grew out of the limitations of Hadoop MapReduce. While MapReduce was a breakthrough for processing voluminous datasets, it often meant reading and writing to disk between steps, slowing everything down. Spark reimagined this model by keeping much of the spirit of MapReduce but redesigning it to have the data live in memory. The result: interactive speeds, richer programming interfaces, and a framework that feels more like a language than a pipeline. We will start with the Spark software stack and its interactive shells, which let you experiment quickly with data. From there, we will explore Spark’s core abstraction, the Resilient Distributed Dataset (RDD); these are immutable collections of data that are spread across a cluster. RDDs built for fault tolerance and parallelism. We will look at how Spark uses lazy evaluation, only executing computations when results are actually needed. As part of our discussions, we will distinguish between transformations and actions. We will also see how Pair RDDs enable key/value computations that echo MapReduce but with far more flexibility. We will look at dependencies, partitioning schemes, and the critical difference between narrow and wide transformations, which determines how data moves through the cluster. By the end of this module, my hope is that Spark will feel less like a black-box and more like a fluent environment one where:(1) you can keep data in memory, (2) chain transformations elegantly, and (3) scale from your laptop to a cluster without changing your code.			[KKWZ] Chap {1-4}
	Objectives: Understand memory residency and its implications for performance. Express processing functionality concisely using lambda expressions. Differentiate between transformations and actions in Spark. Analyze narrow versus wide transformations and their overall impact on program execution. Design Spark programs that leverage RDDs and partitioning to scale efficiently.
	10/23 10/28 10/29 10/30	Lecture 27 Lecture 28 Lecture 29 Lecture 30

Spark Streaming [CS555]
In our discussions so far, Spark was still rooted in a batch model: you feed in a dataset, run transformations, and get results back. What if the data never stops arriving? That is the challenge Spark Streaming was designed to solve. Spark Streaming extends Spark’s architecture to handle continuous flows of data (examples of these include Twitter feeds, sensor readings, clickstreams) without abandoning the familiar Spark programming model. Instead of rewriting everything from scratch, Spark Streaming introduces the idea of micro-batching: slicing the data stream into small batches and processing each with the same abstractions we already know from Spark. We will explore the architecture and abstractions behind the Spark streaming model. In particular, we will trace how data flows through the system and how tasks are executed. We will explore both stateless and stateful transformations. Stateful transformations allow information to persist across batches. We will also look at windowed operations, which let us zoom in on slices of time, and discuss the performance considerations that arise when results must be updated in near real-time. My hope is that by the end of this module, Spark Streaming will feel like a natural extension of Spark: the same building blocks that are now applied to data that arrives continually.			[KKWZ] Chap {5,6}
	Objectives: Explain the differences between traditional batch-style computations and streaming. Explain the role of micro-batching in Spark Streaming. Design programs that support incremental computation of results. Analyze the trade-offs between stateful and stateless transformations. Evaluate performance considerations in streaming applications.
	11/4 11/5	Lecture 31 Lecture 32 [Example; not on the exam or quizzes]

Unstructured P2P Systems [CS555]
Unstructured systems (like Napster and Gnutella) rely on looser, more ad hoc connections. BitTorrent, with its swarming downloads, shows how these ideas can power real-world applications at scale. We have looked at Napster earlier; in this module we will look at Gnutella alongside strategies used in unstructured P2P systems to scale "search".
	11/6	Lecture 33


Time & Logical Clocks [CS555]			[NS] Ch {3} [JS] Ch {3}
In distributed systems few aspects are more fundamental (or more slippery!) than time. On a single machine we take it for granted that events can be ordered by the system "clock". But spread those events across thousands of machines, each with its own imperfect sense of time, and the problem becomes much trickier. Synchronizing clocks turns out to be one of the core challenges of modern distributed computing. In this module, we will explore how time is kept and shared in distributed environments. We will start with the most concrete anchors: global positioning systems (GPS) and their role in providing precise time signals. Next we will move into time synchronization algorithms such as the Berkeley and Cristian algorithms along with methods tailored for wireless settings where communication is much less reliable. But physical time can only take us so far. To capture the order of events without relying on perfectly synchronized clocks, distributed systems introduced the idea of logical clocks. We will be looking at Lamport clocks, vector clocks, and even matrix clocks. Each of these offers a richer picture of causality in distributed computations. To summarize distributed systems measure time in two ways: (1) the imperfect but practical signals of the physical world and (2) the elegant constructs of logical clocks that let us reason about order even when the real clocks disagree.
	Objectives: Explain the importance of time synchronization in distributed systems. Describe the role of physical time sources (such as GPS) in providing synchronization. Explain how synchronization algorithms like Berkeley’s and Cristian’s operate including their assumptions and limitations. Contrast the capabilities of different logical clocks: scalar, vector, and matrix. Analyze the trade-offs between physical and logical time in terms of accuracy, scalability, and system complexity. Given a set of choices evaluate which clock model or synchronization approach is appropriate for a given distributed application.
	11/11	Lecture 34


Replication, Consistency and Coherence [CS555]			[TvS] Chap {7} [Amazon-Consistency Paper]
Replication is one of the most powerful methods in distributed systems. By keeping multiple copies of data we improve availability and performance. But replication brings its own challenges: (1) how do we make sure all the copies stay consistent and (2) what trade-offs are we forced to accept? These questions go to the heart of distributed computing. In this module we will explore the performance and correctness implications of replication and consistency. We will look at data-centric and client-centric models of consistency. We will also explore how different guarantees such as sequential consistency, causal consistency, monotonic reads, monotonic writes, read-your-writes, and writes-follow-reads shape the behavior of applications. Next, we will dive into consistency protocols and explore how systems decide where to place replicas for efficiency and fault tolerance. We will also look at one of the most foundational concepts n distributed systems, the Brewer’s CAP theorem, which formalizes the unavoidable trade-offs between consistency, availability, and partition tolerance, and consider its implications for system design. The CAP theorem underpins the core design decisions in large-scale distributed systems. Hopefully once we have wrapped up this module replication will no longer seem like a simple knob to turn up or down. My hope is that you will understand replication/consistency as a balancing act between correctness, speed, and resilience: a balance that every distributed system must strike.
	Objectives: Explain the motivation for replication and the challenges it introduces. Differentiate between data-centric and client-centric consistency models. Describe guarantees such as sequential consistency, causal consistency, monotonic reads/writes, read-your-writes, and writes-follow-reads. Analyze the performance and correctness implications of different consistency protocols. Evaluate strategies for replica placement in terms of fault tolerance, latency, and scalability. Design systems that respect the constraints of the CAP theorem making informed trade-offs between consistency, availability, and partition tolerance.
	11/13 11/14	Lecture 35 Lecture 36 {informational only; not on quizzes or the Final}

Extreme Scale Distributed Storage Systems [CS555]
At small scales, building a storage system feels straightforward: put data on a disk, add redundancy, and manage access. But at the scale of Amazon and Google these simple solutions break down. The need to store petabytes of information across thousands of machines forces a rethinking of the very foundations of storage design. In this module, we will examine two landmark systems that transformed the field of extreme-scale storage. Amazon’s Dynamo was designed for high availability and resilience under failure. We will look at its assumptions and requirements, its partitioning and replication strategies, its use of versioning, and the architectural choices that made it robust in practice. You will also be very pleasantly surprised as to how Dynamo weaves together so many of the concepts that we have covered so far. We will then turn to the Google File System (GFS) which took a very different path. Built to support massive data-processing frameworks GFS introduced new abstractions: (1) files chunked into large blocks, (2) centralized metadata management, and (3) novel operations such as atomic appends and snapshots. We will also explore how it handles replication, consistency, and garbage collection. Together, both Dynamo and GFS highlight not only the diversity of approaches to scaling storage but also the trade-offs between availability, consistency, simplicity, and performance. By comparing Dynamo and GFS side by side, you will see how design decisions reflect both the technical constraints and the unique needs of the applications they serve.			[Dynamo-Paper] [GFS Paper]
	Objectives: Explain the motivations behind extreme-scale storage systems. Describe the key design decisions in Amazon Dynamo and Google File System. Analyze how assumptions and requirements shape system architectures. Contrast different approaches to scaling as exemplified by Dynamo and GFS. Evaluate trade-offs in partitioning, replication, and consistency strategies. Design informed strategies for storage in distributed environments applying lessons from Dynamo and GFS.
	11/18 11/20 12/2 12/4	Lecture 37 Lecture 38 Lecture 39 Lecture 40


Term Project Presentations [CS455 and CS555]
Each group will present their term projects to the entire class. The format of these term project presentations will be prescriptive so that the core achievements can be highlighted. The term project presentation guidelines will be posted in late October.
	Objectives: Present research work coherently and objectively to a peer audience. Highlight and assess key contributions of a work. In particular, identify which works are innovative and what makes these works so compelling. Frame questions effectively during research presentations.
		Term Project Presentation Guidelines Schedule A Schedule B

Comprehensive Final Exam in CSB-130 Monday, December 15th, 2:00-4:00 pm