Discussion Notes

Session 1: The Promise and the Peril of Really Big Models

Date: June 16 8:30PM EDT/June 17 8:30AM HKT

Reading: Bender, et al. (2021) On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

Questions and Discussion Points:

Model Architectures

  • How do 1-shot and 0-shot learning work?
  • How important is attention in model training?
  • How do multilingual models work?
  • What are some good approaches for multiple machine learning models working together?

“Good Models”

  • What are some different ways of scoring machine learning models?
  • Is there a way to make toxicity scoring work better?
  • When is a machine learning model too small? Is there such a thing?

Data Sourcing

  • What are the economies of training data?
  • If not the internet, where is a better place to get data?

Energy Efficiency

  • How can machine learning be made more energy efficient?
  • Is there a more efficient way to do consensus (e.g. emergent consensus)?

Distributed Systems

  • What is a distributed system?
  • What are the foundations of distributed systems?
  • What are the roots of consensus algorithms?
  • How is reinforcement learning used in distributed systems?
  • What is an “intelligent distributed system”?

Session 2: What Archeology Can Teach Us about Distributed Systems

Date: June 23 8:30PM EDT/June 24 8:30AM HKT

Reading: Lamport, Leslie. “The part-time parliament.” Concurrency: the Works of Leslie Lamport. 2019. 277-317.

Questions and Discussion Points:

The Consensus Metaphor

  • To what extent did Lamport mean to tell the Paxos story in earnest versus an extended allegory?
  • How well has the consensus metaphor served the distributed systems community?
  • What is concurrency and what does it look like in the context of the part time parliament metaphor?

Consistency of the Distributed Log

  • Consistency is about synchronization, which repairs entropy.
  • What are the different types of consistency and which is illustrated by Lamport’s part time parliament?
  • What is the difference between eventual consistency and monotonic reads?
  • What is the difference between a “log” and the idea of a distributed ledger? Are these the same?

Paxos & Optimizations

  • Do any real world applications use Paxos?
  • 2 rounds of communication are required for Paxos, this can be optimized using leader optimization (remove the Prepare phase), ballot optimization, or through optimistic consensus (e.g. Fast Path, ePaxos).

Session 3: Fantastic Failures and Where to Find Them

Date: June 30 8:30PM EDT/July 1 8:30AM HKT


Muralidhar, et al. (2014) f4: Facebook’s Warm BLOB Storage System

Questions and Discussion Points:

Optimization vs. Flexibility

  • Is optimization (e.g. using data age, application-level requirements) enough to make a distributed system “intelligent”?
  • Where do heuristics about data placement start to break down? For instance, do the rules that work for a big content creator with millions of followers also apply to everyday social media users?
  • What features could you use in a distributed system to train a model that would nominate data to be moved from hot to warm or cold storage?
  • Could you build a distributed system that would be able to adapt to new blob types, e.g. know how to store them most efficiently?

The UX of Failure

  • As users we often experience failures from eventual consistency:
    • Buying plane tickets - when is the purchase final? How do we know?
    • Food ordering apps - the fries gets cold while we’re waiting for the system to get consistent.
    • We thought we won the game, but a few seconds later, we’re told it was another player.
  • We also experience failures in data systems that are in transition:
    • Systems that experience massive (e.g. seasonal) spikes in usage that break consistency.
    • Systems that are being changed to adapt to new market conditions (e.g. conflict between point of sales systems and e-commerce systems).
  • We have also experienced correlated failures, such as all of our drives failing simultaneously.

Planning for Failure

  • What is the robustness model when failure happens? What’s the “intelligent” approach to recover from failure?
  • How does data encryption work in a distributed system with correlated failures? What’s the relationship between fault tolerance and encryption?

Related Papers:

Lu, et al. (2015) Existential Consensus: Measuring and Understanding Consistency at Facebook

Santry, et al. (2004) Elephant: The File System that Never Forgets

Session 4: What is “Time” in a Distributed System?

Date: July 7 8:30PM EDT/July 8 8:30AM HKT

Reading (select one of these):

Lamport (1978) Time, Clocks, and the Ordering of Events in a Distributed System

Corbett, et al (2012) Spanner: Google’s Globally-Distributed Database

Questions and Discussion Points:

“Happened Before”

  • In the Lamport paper, the “happened before” relation evolves from a simple relationship between events in the same process (–>) to a relationship between processes happening across a system of clocks (==>) and finally to an even stronger relation that describes distributed processes within and outside the system (–>).
  • The system of clocks relies on two things to synchronize; first, a mechanism to coordinate via monotonic counters, and a method for arbitrarily ordering system processes (e.g. a global unique identifier for system processes).
  • Colloquially we are used to thinking of the word “concurrent” to mean something like “simultaneous”. However, Lamport introduces a novel way of thinking about concurrency with respect to the “happened before” relation; namely that two distinct events a and b are concurrent if a did not happen before b and b did not happen before a. From the system’s perspective, these two events may as well have happened simultaneously, since we don’t have a way to order them.
  • How well does the system of clocks scale as system size increases?

Synchronization for the Rich and Famous

  • Google’s globally distributed system Spanner demonstrates an evolution from the Lamport clock for the age of big data.
  • Spanner leverages TrueTime, an API that represents time as an interval and which accounts for variations across geographically remote components of a system. TrueTime is not source available, but there are some details about how it works in this paper and in follow on publications.
  • Spanner depends on expensive hardware, including both atomic clocks and GPS devices, apparently collocated with every single rack in the system.
  • Spanner is commercially available but does not seem to have much marketing behind it. It is very expensive. Could it be made cheaper by increasing the size of the TrueTime window?
  • How well does Spanner work as the system nodes get further apart geographically? Although the phrase “global” comes up a lot in the paper, the examples seem mostly to detail coordination across regions in North America.

To Tick or Not to Tick

  • Lamport’s paper ends with a proof that physical clocks exist — meaning clocks that are totally differentiable, with no intervals and no ticks. This was likely Lamport’s goal, since at the time he was working to try to devise a computer-based clock for SRI.
  • However, 34 years later, the Spanner paper seems to call the existence of physical clocks into question. If Google can’t do it without intervals and fancy hardware, can it be done?

Related Resources:

SE-Radio (2019) Episode 377: Heidi Howard on Distributed Consensus

Kleppman (2020) Distributed Systems 8.2: Google’s Spanner

Session 5: Can Distributed Systems Learn?

Date: July 14 8:30PM EDT/July 15 8:30AM HKT


Bengfort, et al. (2019) Anti-Entropy Bandits for Geo-Replicated Consistency

Questions and Discussion Points:

Punishment and Reward

  • The authors apply reinforcement learning (RL) to optimize anti-entropy in a distributed system.
  • Coming up with a good reward function is a common blocker to implementing RL. The clarity of the reward function (rewards for fast, productive sessions) is interesting; this is a great application for RL.

Lessons Learned

  • There are a few human-detectable patterns from the RL, such as data center co-location. The model has learned that physical distance is a good measure to improve anti-entropy, but it’s not a perfect correlation. 
  • It’s difficult to tell the differences in performance between the nodes.
  • Another result is that not all data centers are created equal, which you can see in the data, especially when they are in the same geographic area e.g. Montreal performs far worse then VA or Ohio in terms of moving data across the network.
  • One notable absence is that we didn’t see a hierarchy emerge; we expected to see super-spreaders in different regions, but that wasn’t prevalent.
  • It seems like there would be a great deal of overlap with routing? There are some very interesting routing algorithms (e.g. traveling salesman) though there are different requirements. One significant difference is that routing is local only vs. the global system the authors experimented on.

Experimental Design

  • The authors selected all regions that were available at the time (ex Beijing) with the caveat that some of these regions have many availability zones and they selected A, B, and C.
  • The accesses were balanced and consistent, which made for good experimental conditions, but are admittedly not representative of the dynamism of throughput in real applications.
  • It was useful to do a thought experiment and think of the system as a phone tree that disseminates downward, which is similar to the super-spreader analogy. The authors controlled for many variables; in the real world, you would most likely see more variability and a heterogeneous topology. If your network optimized for cost, would you show the cheapest nodes vs speed vs compute? Or you could optimize the network for speed or compute.
  • With reinforcement learning in general, one issue is: What happens when the world changes? When a system is dynamic? There is some research on this but it often comes down to hard resets; it’s often 100-150 time steps to learn the new system, which is not very long, but it doesn’t allow for much flexibility.

Achieving Fairness

  • One interesting concept is “stomping” i.e. when there are concurrent writes, which gets stomped out? Is there any discussion in the community about equitable stomping? Does it depend on a region that is faster, or more storage? Is there any discussion about how to make it as random as possible beyond Precedence IDs?
  • Precedence IDs are human assigned by sys admins and it’s an opportunity to inject human bias into the system; e.g. VA > CA > OH etc.; in the system Rotational just deployed, we assigned Precedence IDs using a round-robin method.
  • This raises an interesting question: What other ways can human bias be injected into a system that we may not understand or be conscious of?
  • In some ways, it’s reasonable to assume any intelligent system will have some human bias injected e.g. a system could learn bias based on user engagement (heavy user); we may not be able to remove bias but can we find ways to combat or adapt the system for a bias?
  • Randomization may be one way to combat bias; it’s an open question if bias is worth thinking about it as long as the system has been consciously set up to combat it.
  • Another thought experiment: Imagine a linear network that propagates. There’s no way to know how far a write has propagated; so the world could read a write that has a lower precedence but it’s the last to know; however, consensus does not allow that to happen; it totally order the writes.

Future Research

  • Can you optimize for cost? For example, in Google Cloud, you incur a cost between continents; you really only want one node and not all nodes in the transfers; on the other side of the coin, you want bandwidth; knowing that you’ll always have a transfer vector, you can ask questions such as: Can you minimize the transfer cost? Are there intersectional differences inside the cloud? 
  • The other thing to think about is user experience. Might we include inconsistencies in the reward function?
  • Another possible optimization is objects edited together vs objects edited separately? Can we increase their reciprocal transparency visibility based on this variable? 
  • What would happen if we injected chaos; how does that impact the system?
  • How could this network be more responsive, more flexible? What happens when accesses change over time? 

Session 6: Languages of Distributed Systems

Date: July 21 8:30PM EDT/July 22 8:30AM HKT

Reading: gRPC Authors (2021) What is gRPC?

Questions and Discussion Points:

HTTP vs HTTP2: A Major Upgrade

  • Much of gRPC’s functionality is enabled by HTTP2
  • One of the biggest benefits of HTTP2 is it allows for full duplex communication; the web browser can make one request and get all of the data at once (server to client + client to server)
  • A lot of things got jammed into HTTP that it wasn’t necessarily designed for e.g. web sockets, long-term connections, etc.
  • HTTP2 is a binary format vs plaintext in HTTP; a lot of the issues in HTTP are due to plaintext encoding
  • Because of its binary format, HTTP2 adds security and massively improves performance e.g. variable inline encoding, compression, better performance management overall, etc.
  • HTTP is like the Model T in that Ford didn’t know we would build interstate highways; HTTP2 is focused on what the internet will look like in 30 years and now the question is when we’ll see it in browsers

API Design

  • RESTful APIs are designed to be human-readable whereas gRPC is generally for microservices communication since it is much faster, takes up much less memory, etc.
  • One design consideration is human-to-machine communication vs. machine-to-machine or service-to-service communication
  • gRPC is a strong contract about how to talk to each other between the data provider and data consumer; it’s a very firm and specific but it opens up flexibility for a wide range of business use cases
  • gRPC is used widely in microservices e.g. shopping cart management services that talk to inventory management services, etc.
  • gRPC allows for pre-processing or “pre-code” to be run before services can talk to each other, which means strict communication protocols e.g. the service has to conform and meet certain requirements before it can talk to another service
  • gRPC is useful for advanced services such as moving around large amounts of data between services
  • Restful APIs are suited for external programs since they allow for flexibility whereas gRPCs could be better for internal services because they can tightly control design architecture e.g. Google as an example. gRPC is more efficient for asynchronous communication vs REST which is better for sequential communication
  • Generally, gRPCs are well-suited for intelligent distributed systems because of its dependence on HTTP2 and its point-to-point communication (that is initiated by one point, not either)
  • There are other lower-level libraries like ZeroMQ that could be useful for distributed systems

Protocol Buffers in The Wild

  • Probably many of us have used protobufs because that’s how Tensorflow works e.g. passing messages between the neural network layers
  • Tensorflow has several protobuf and gRPC dependencies; for example, TensorFlow Serving, where models are hosted, is gRPC

Programming Languages & Distributed Systems

  • Languages that get closer to the core of how the computer/machine communicates seem to be better for distributed systems
  • In contrast, a language like Python is designed to be more human-readable and is less suitable for distributed systems
  • While protocol buffers are meant to be language agnostic, gRPC is especially popular with C++ and Golang developers, which makes sense when you think about the built-in coroutines (or Go routines) and channels that are features of those languages. Similar to the reason that Python developers like JSON so much, since it can be seamlessly converted into dictionary objects.

Related Resources

Session 7: Takeaways from HotStorage 2021

Date: July 28 8:30PM EDT/July 29 8:30AM HKT

Reading: This week, instead of a shared reading, please register and attend some portion of HotStorage 2021, so that we can discuss highlights and takeaways during our usual meeting time.

Questions and Discussion Points:

Key Papers

Terms of Interest

  • “NDP” - near data processing; a big focus of the conference. Managing hundreds of storage nodes, making the data more distributed
  • “CRDTs” - conflict-free replicated datatype
  • “LSM tree” - log-structured merge tree
  • “SSD” - solid state drive
  • “RPO” - recovery point objective
  • Blocks vs Rocks - Blocks are uniformly sized vs. rocks, which are more flexible and variable-sized and supports more granularity and more/ different types of file sizes; PebbleDB is smaller rocks
  • Shards vs Slices vs Stripes
  • “rowhammering”
  • PFS: why no global IDs?
  • Two-phase backup scheduler

Types of Consistency

  • We mostly have been talking about eventual consistency so far. Consistency is about what users can expect from a system.
  • As you move up the spectrum toward strong consistency, the more coordination is needed i.e there is a lot of message traffic, which often leads to degraded performance, which runs into the CAP problem.
  • There is a spectrum from weak to strong consistency:
    • “weak consistency”: No guarantee that systems will converge
    • “eventual consistency” (and “strong eventual consistency”): If you stop writing, the system will eventually become consistent.
    • “causal consistency”: guarantees all processes observe causally-related operations in a common order.
    • “sequential consistency”: Ordered operations that are observable e.g. the Lamport paper
    • “strong consistency” (aka linearizability): For every person on earth there is a single order of operations. Spanner is probably the closest example we’ve looked at so far.

New Technologies

  • Databases, File Systems, and Object Storage

    • CockroachDB
    • YugaByte
    • RocksDB
    • PebbleDB
    • ElmerFS
    • Lustre and BeeGFS (Parallel file systems)
    • Minio
    • Ceph
    • Storj.io
    • etcd
  • Consensus Algorithms

    • Raft
    • Multi-Paxos
    • EPaxos
    • Pig Paxos
  • Libraries

    • DeepLog

General Questions

  • Why is it called “Hot Storage”?

    • Originally called “Hot Topics in Storage Systems”; follows from a series of “Hot” topics in computer science such as security, networking, etc.
    • The name Hot Storage is also connected to electrical engineering and it’s relation to storage systems i.e. thousands of spinning discs
    • Attracts a lot of interesting domains e.g. bioinformatics, DNA storage systems, ML and optimization - very interdisciplinary
  • What’s with the emphasis on flash storage?

    • Might have to do with the hardware available to the participants; best way to experiment and try out new ideas and configurations.
  • Is it common for there to be so many theoretical papers (compared to ones with experimental results)?

    • Hot Storage has become a primary publishing venue for grad students; it gives researchers the opportunity to share/ discuss research even if they don’t have complete results yet
  • Not a lot of representation from the big cloud folks - Google, Amazon, that other one. Why is that?

    • FAST is the larger conference counterpart to Hot Storage; a lot of big tech companies (Twitter, Google, FB) sponsor or attend FAST

Session 8: What is a “File” Anyway?

Date: August 4 8:30PM EDT/August 5 8:30AM HKT

Reading: Rosenblum & Osterhout (1991) The Design and Implementation of a Log-Structured File System

Questions and Discussion Points:

The “Philosophy” of Files

  • In our every day lives, files seem mundane, but this paper raises philosophical questions about their existence that prompts a re-think. Files can be something, anything, any piece of data that you want to write to disk with some meta data. Files are made of even more basic units i.e. bits; files are an amalgamation or a collection of bits much like people are collections of molecules; the bits come together to do things and then eventually disband and go on to do other things
  • If you look back at many of the Hot Storage papers from last week, the knowledge and tenets of this paper are in many ways embedded in those papers
  • Another interesting question is the definition of “live” vs “dead” data; is “dead” data just deleted data? How do we define “dead” data? When deleting something from a file system, a lot of files systems work by removing just the meta data so it’s somewhat easy to restore data; so truly erasing a disc means scrambling the bits so the patterns can’t be reconstructed i.e. the pointers are gone but bits are still there. A lot has to happen to make data “dead” e.g. the inode that points to the chunk has to be re-written and then we have to re-write the inode map. “dead” data can also be any data that will be consumed by the reclamation process i.e. the process that it will reclaim chunks that no longer have an inode pointer. So there are many ways data can “die” e.g. when the inode is re-written, when memory is re-written

The Physics of Files

  • The block is the physical storage on the disc and each disc is a certain number of bits. Files can be decomposed into blocks (e.g. a big file might require several or many blocks). Blocks can also contain multiple files (e.g. if the files are smaller than the block size).
  • The file system in the paper is highly dependent on the physics of spinning disks; there is rotational latency and seek latency; the access speed was constrained by the time it takes to physically travel; it makes us appreciate the mechanics of the hardware
  • Today we have SSD and Flash, so seek latency is reduced but still exists; it happens in a different way; storage systems jump around to different bits e.g. NVM; there’s still overhead so you still want to engage in as much sequential writing as you can
  • The log structure is still relevant even though SSD and Flash have reduced the physics of discs
  • It’s interesting to note that this paper was published around the time SSD technology was under research

Distributed File Systems

  • The way that people initially solved the shared file system was using the NFS model; Dave makes a change and then information about the change goes through the network and everyone else gets to see the update, which became increasingly less performant as computers got faster and people started to create more content. Our speed expectations changed a lot!
  • When editing a shared file, if a file is in blocks, then ask which file is different? diff!
  • Git has a good diff tool; it has to track many changes from many sources; git doesn’t think of your Python program as single thing; it looks at it in chunks and does some math and looks at what’s different; it doesn’t matter what has changed, but if there’s a change, it commits the change; git uses something like a Merkle tree to break things up into chunks and decide what has changed vs what has not

The File System vs the Application

  • The file system doesn’t have much control over what happens; it controls the inode but the application has much more control over actions; write/seek/chunk/flush operations are what the file system understands but the application has a say; e.g. Vim will do things on a line-by-line basis vs emacs, which will write things in chunks
  • Git is different as it’s saving multiple files from multiple sources so it has much more control
  • It’s interesting to note the hardware will “lie” to you about what’s really happening, especially with encryption you lose some ability; RAID will also “lie” as its just putting things into the buffer
  • A buffer is another element of a file system: it intentionally delays the writes to accumulate them and then write a chunk together; buffers may seem like an annoyance since they intentionally slow things down for sequential write efficiency
  • But there’s also a danger to buffers because what happens if we buffer for too long? You could lose data and run into the security issue of a buffer overflow e.g. buffer is full, electricity goes out, and then everything unwritten is gone

Building Your Own File System

  • It’s feasible (yet not trivial) to create your own.
  • Creating a file system is putting a fence around a chunk of storage on your machine and saying “this is mine” for the file system, and then using Fuse to write functions to manage it.

Thinking About New Architectures

  • A significant amount of work is done just moving data around; many systems today rely on technology that emerged in the 1960s
  • Today we have new specialized chips (ASICs), new storage options, etc
  • What if we built a system from the ground up to take advantage of these new technologies to build intelligent agents that match compute and storage with location; what could we free up computers to do? what could we free up people to do?
  • One possible result is this could change the nature of a program.
  • There’s interesting research emerging around these topics. Check out this keynote address by Peter Alvaro at this year’s PAPOC conference.

Related Resources

Session 9: Computing on Distributed Data

Date: August 11 8:30PM EDT/August 12 8:30AM HKT


Zaharia, et al. (2012) Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Questions and Discussion Points:

Distributed Computation vs Distributed Storage

  • Although the Spark paper is not strictly related to intelligent distributed systems, the model that Spark uses to distribute computation across sections of a dataset is very reminiscent of the shared log abstraction from the distsys literature.

Which Models Distribute?

  • Models that are convenient to sparkify: when it’s straightforward to express the algorithm in terms of a series of transformations
    • Logistic regression & other GLMs
    • Tree-based models
    • kMeans clustering
    • topic modeling
  • Models that don’t work well with the spark model: when partitioning the data interferes with the model’s ability to learn or generalize the decision space
    • kNN
    • non-linear SVM
    • time series data models

Other Ways to Scale Computation

  • For the Python/NumPy/Pandas/Sklearn Community
    • Dask: advanced parallelism for analytics
    • Koalas: pandas API on Apache Spark
    • MLlib: Apache Spark’s scalable machine learning library
  • For Time Series Analytics
    • btrdb: rapid storage and analytics for scalar-valued timeseries data
    • Timescale DB: PostgreSQL for time‑series
  • More!
    • Ray: Fast and Simple Distributed Computing
    • Julia: a new programming language for high performance analytics

Related Resources

Bengfort & Kim (2016) Data Analytics with Hadoop

Session 10: The Importance of Understandability

Date: August 18 8:30PM EDT/August 19 8:30AM HKT


Ongaro & Ousterhout (2013) In Search of an Understandable Consensus Algorithm

Questions and Discussion Points:

Raft vs Paxos

  • This paper is presented in a different way vs Lamport’s Paxos paper; it’s mostly about implementation; Paxos was more about theory (and even storytelling); this was about implementation with primitives like RPCs
  • There are similar concepts in the two papers like the logical clock, turns, etc
  • Raft “feels” of a different, newer era; Paxos assumes compute time is expensive so presents a theory based on those constraints vs. Raft which was written in a time when compute is much more abundant
  • Raft is a more understandable algorithm compared to Paxos; with Paxos, the algorithm is revealed piecemeal and is not so much presented as a system; Raft is presented as a complete system/implementation
  • The main thrust of Raft was about understandability; it offers similar performance as Paxos but is much simpler to reason about
  • As engineers, we work to marginally optimize performance whereas understandability is often undervalued and understated; Diego’s (paper’s author) hook was understandability and he was surprised that it was so compelling
  • Another angle of the Raft paper is it’s application and adaptability to situations in the real world
  • Raft also used creative ways to express functionality e.g. the “heartbeat” as the mechanism for the system to check the state of the leader; maybe this makes it more relatable
  • Also what makes it relatable is that two Stanford PhDs said out loud that “Paxos is hard to understand”, which was probably novel
  • However, not everyone thinks Raft is easier; it depends on your mental model; the lineage that produced Raft believed that Paxos was intentionally obfuscated because of how the paper was written vs another lineage who inherited from the Paxos tradition
  • For example Michael Whittaker’s paper on “compartmentalization” whereby the leader delegates/ communicates to a lower level set of proxy leaders to take pressure off the leader, might be seen as a Paxos “revival”
  • The Paxos paper included a formal proof whereas Raft was about an interesting/unexpected edge case and Raft was the simplest/best solution
  • It is interesting to perceive different cultures and lineages in the distributed systems domain

Paxos vs Raft in Use

  • Spanner: Paxos; this is a Google implementation of fast Paxos
  • Chubby: Paxos; a distributed lock server that was built to bail the GFS out of trouble
  • Zookeeper: is Paxos-like but technically ZAB not Paxos
  • Kubernetes: etcd, which uses Raft
  • MongoDB: Raft for replication
  • Yugabyte: Raft
  • CockroachDB: Raft
  • Interesting to apply the “90/10” rule to Paxos and Raft
    • Paxos is hard implement for 90% of use cases, but once you get to 90%, the remaining 10% is “easy”
    • Raft covers 90% of use cases, but the “dragons” live in the 10%; those high throughput, edge cases, that’s when things get difficult or impossible in Raft
  • Because of its understandability, Raft became a class project in many comp sci grad classes; many grad students implemented Raft and, in doing so, found a lot of edge cases where Raft struggles or doesn’t work
  • For example, many found that Raft can get into “thrashing” where there are multiple competing leaders who keep incrementing the term number and so candidates keep failing because they can’t get enough votes, which makes Raft unavailable; basically, everyone fails and no one has availability
  • In contrast, Paxos is designed so it will never be unavailable
  • In Raft, time is not a “lie”; we can use an actual clock and call each other to figure things out
  • Raft uses actual clocks vs Paxos, which uses message time and the ordering of when messages are received, which is not really time; one is more intuitive whereas the other is more abstract
  • From an engineering perspective, these are two different philosophical approaches to the application of time to systems
  • One issue in Paxos is “Eventual Availability”; when a node becomes a leader, it’s not guaranteed that they are up to date; there can be a slow period of updating if some of the machines are older or use outdated technology, which can’t happen in Raft
  • Technically, Paxos and Raft are actually not that different; the algorithms are similar and both satisfy the two main conditions of (1) state machine safety (if one server applies a log entry at a given index, no other server will apply a different log entry to that index) and (2) leader completeness (if an operation is committed at some index by the leader, all subsequent leaders will have that same operation at the same index); yet they are in many ways defined by their “origin stories”

Scalability of Raft & Paxos

  • How much applied research has there been about the scalability of Raft and Paxos?
  • They are both single leader distributed log systems; it seems the leader will get overwhelmed with messages and communication as it scales
  • A paper called “Raft Re-floated” explored the scalability of Raft
  • It focuses on the importance of what happens at boot time in Raft; there’s a balancing act; there is an interval/ heartbeat because if all nodes send votes at the same time, then they will land at the same time and will stall; whereas if the time interval is too long between heartbeats, then the wait time can take too long for modern systems and user expectations; so what you do at boot is important
  • Raft does not scale beyond 5-7 nodes; you’ll quickly run into the “thrashing” problem
  • How is Raft so popular if it will thrash at 5-7 nodes? Two possible explanations for Raft’s popularity: (1) In its design, the most performant node will “win” i.e. the node with more connectivity/ CPU/ Up Time will outperform the other nodes; (2) It’s much more easier to implement compared to Paxos
  • The scalability of Paxos depends on how it’s implemented e.g. compartmentalization is an approach where the leader delegates some communications to proxy leaders and the acceptors scale in a round-robin process
  • One analogy for scaling both algorithms is imagining building a row of legos and you want to place the same bricks with the same colors on top of each other quickly; we can go as fast we can, but that will most likely result in the row being out of order
  • All of these options are about how to slow down and get it right in the same order vs. speed
  • In reality, it’s hard to compare Raft vs Paxos at scale, there’s no easy way to benchmark them, which is why a consensus API may be useful!

Related Resources

Whittaker (2020) Scaling Replicated State Machines with Compartmentalization

Whittaker, et al. (2020) Scaling Replicated State Machines with Compartmentalization (Technical Report)

Farewell and Onto New Things…

Date: August 25 8:30PM EDT/August 26 8:30AM HKT

Reading: No reading this week! We’ll be talking about two open source opportunities…

Open Source Projects:


  • Anti-entropy replication is common in systems but not generally used by application developers; the goal of Honu is to make anti-entropy more accessible
  • We can build a library for embedded database replication and with eventual consistency we want to prove there is no additional overhead with replication
  • Benefits developers with geo-distributed system across the world with different latencies, different network conditions
  • Honu provides eventual consistency and allows for different types of bandits and different types of possibilities
  • This project has a code base, which is the bare minimum setup for anti-entropy
  • We have an embedded database that we want Honu to wrap; all it does now is wrap a LevelDB database, though we could have an embedded Badger or Rocks or sqlite database
  • Embedded databases are often used for local development; Honu wraps that key-value store, gives you the benefits of speed, and gives you replication, which you can put on a server
  • Honu versions the data instead of overwriting it; it preserves the version history
  • It has 4 basic actions: get/put/delete/iter
  • The database is actually saving objects that are defined by default to have geo-replicated properties
  • It keeps info on parents so you can follow-back version history, it’s not real time but sequenced
  • What’s not implemented yet is replication; there’s a sketch but there are different ways we can do this
  • There’s also a nil variable so we can add options and features
  • We have some benchmark data too; Honu is comparable to bare-metal LevelDB, which is both promising and means there’s a lot of room for improvement
  • We want to investigate the consistency “gotchas”
  • There are opportunities to code, write, and experiment
  • We can move in several directions:
    • Build it to wrap or bolt on replication to an existing database;
    • Make it general purpose i.e. “bring your own database”;
    • Focus only on LevelDB since it’s available in many different languages so we could implement Go/Python/C++ Honu
  • We could also think about how to implement partial replication
  • You don’t need to be a skilled Go developer because the advanced features in Go are not used heavily in Honu; it’s a single process, not using channels.

More resources about anti-entropy & Honu


  • Organized around the idea that there is no one best consensus algorithm
  • We don’t have a repo yet; it’s just an idea!
  • The paper we read “Scalable but Wasteful” calls into question if consensus algorithms really help people in industry
  • It would be an API for consensus similar to the sci-kit learn API for ML
  • With sci-kit learn, they came up with a common interface for ML (fit + predict; fit + transform)
  • It was based on the premise that there’s no such thing as a “best” ML model
  • There’s some interest in this project from academia.
  • This would be a big undertaking since we have no code base and we’d have to explain the use cases and its application
  • Up for the challenge? Sign up here

More resources about Concur