In Go, on cache key eviction, memory is not immediately freed. Instead, the garbage collector runs every so often to find any memory that has no references and then frees it. In other words, instead of freeing immediately after the memory is out of use, memory hangs out for a bit until the garbage collector can determine if it’s truly out of use. During garbage collection, Go has to do a lot of work to determine what memory is free, which can slow the program down.
These latency spikes definitely smelled like garbage collection performance impact, but we had written the Go code very efficiently and had very few allocations. We were not creating a lot of garbage.
after digging through the Go source code, we learned that Go will force a garbage collection run every
2 minutes at minimum . In other words, if garbage collection has not run for 2 minutes, regardless of heap growth, go will still force a garbage collection.
We We figured we could tune the garbage collector to happen more often in order to prevent large spikes, so we implemented an endpoint on the service to change the garbage collector GC Percent on the fly. Unfortunately, no matter how we configured the GC percent nothing changed. How could that be? It turns out, it was because we were not allocating memory quickly enough for it to force garbage collection to happen more often. We kept digging and learned the spikes were huge not because of a massive amount of ready-to-free memory, but because the garbage collector needed to scan the entire LRU cache in order to determine if the memory was truly free from references. Thus, we figured a smaller LRU cache would be faster because the garbage collector would have less to scan. So we added another setting to the service to change the size of the LRU cache and changed the architecture to have many partitioned LRU caches per server. We were right. With the LRU cache smaller, garbage collection resulted in smaller spikes.
Unfortunately, the trade off of making the LRU cache smaller resulted in higher th latency times. This is because if the cache is smaller it’s less likely for a user’s Read State to be in the cache. If it’s not in the cache then we have to do a database load.
After a significant amount of load testing different cache capacities, we found a setting that seemed okay. Not completely satisfied, but satisfied enough and with bigger fish to fry, we left the service running like this for quite some time.
During that time we were seeing more and more success with Rust in other parts of Discord and we collectively decided we wanted to create the frameworks and libraries needed to build new services fully in Rust. This service was a great candidate to port to Rust since it was small and self-contained, but we also hoped that Rust would fix these latency spikes. So we took on the task of porting Read States to Rust, hoping to prove out Rust as a service language and improve the user experience.²
Rust is blazingly fast and memory-efficient: (with no runtime or garbage collector) , it can power performance-critical services, run on embedded devices, and easily integrate with other languages.³ Rust does not have garbage collection, so we figured it would not have the same latency spikes Go had.
Rust uses a relatively unique memory management approach that incorporates the idea of memory “ownership”. Basically, Rust keeps track of who can read and write to memory. It knows when the program is using memory and immediately frees the memory once it is no longer needed. It enforces memory rules at compile time, making it virtually impossible to have runtime memory bugs.⁴ You do not need to manually keep track of memory. The compiler takes care of it.
So in the Rust version of the Read States service, when a user’s Read State is evicted from the LRU cache it is immediately freed from memory. The read state memory does not sit around waiting for the garbage collector to collect it. Rust knows it’s no longer in use and frees it immediately. There is no runtime process to determine if it should be freed.
But there was a problem with the Rust ecosystem. At the time this service was reimplemented, Rust stable did not have a very good story for asynchronous Rust. For a networked service, asynchronous programming is a requirement. There were a few community libraries that enabled asynchronous Rust, but they required a significant amount of ceremony and the error messages were extremely obtuse. Fortunately, the Rust team was hard at work on making asynchronous programming easy, and it was available in the unstable nightly channel of Rust.
Discord has never been afraid of embracing new technologies that look promising. For example, we were early adopters of Elixir, React, React Native, and Scylla. If a piece of technology is promising and gives us an advantage, we do not mind dealing with the inherent difficulties and instability of the bleeding edge. This is one of the ways we’ve quickly reached 430 million users with less than
Embracing the new async features in Rust nightly is another example of our willingness to embrace new, promising technology. As an engineering team, we decided it was worth using nightly Rust and we committed to running on nightly until async was fully supported on stable. Together we dealt with any problems that arose and at this point Rust stable supports asynchronous Rust.⁵ The bet paid off. The actual rewrite was fairly straight forward. It started as a rough translation, then we slimmed it down where it made sense. For instance, Rust has a great type system with extensive support for generics, so we could throw out Go code that existed simply due to lack of generics. Also, Rust’s memory model is able to reason about memory safety across threads, so we were able to throw away some of the manual cross-goroutine memory protection that was required in Go. When we started load testing, we were instantly pleased with the results. The latency of the Rust version was just as good as Go’s and had no latency spikes!
Remarkably, we had only put very basic thought into optimization as the Rust version was written.
Even with just basic optimization, Rust was able to outperform the hyper hand-tuned Go version. This is a huge testament to how easy it is to write efficient programs with Rust compared to the deep dive we had to do with Go. But we weren’t satisfied with simply matching Go’s performance. After a bit of profiling and performance optimizations,
we were able to beat Go on every single performance metric
. Latency, CPU, and memory were all better in the Rust version. The Rust performance optimizations included: () Changing to a BTreeMap instead of a HashMap in the LRU cache to optimize memory usage.
Swapping out the initial metrics library for one that used modern Rust concurrency .
- Reducing the number of memory copies we were doing.
Satisfied, we decided to roll out the service.
The launch was fairly seamless because we load tested. We put it out to a single canary node, found a few edge cases that were missing, and fixed them. Soon after that we rolled it out to the entire fleet.
Below are the results. Go is purple, Rust is blue.