Erlang Concurrency: Evolving for Performance
- Nelson Vides
- 26th Sep 2024
- 10 min of reading time
Some languages are born performant, and later on tackle concurrency. Others are born concurrently and later build on performance. C or Rust system’s programming are examples of the former, Erlang’s Concurrency is an example of the latter.
A mistake in concurrency can essentially let all hell loose, incurring incredibly hard-to-track bugs and even security vulnerabilities, and a mistake in performance can leave a product trailing behind the competition or even make it entirely unusable to begin with.
It’s all risky trade-offs. But what if we can have a concurrent framework with competitive performance?
Let’s see how Erlang answers that question.
C or Rust are languages traditionally considered performant by compiling native machine instructions, enforcing manual memory management, and exposing semantics that map to that of the underlying hardware. Meanwhile, Erlang is a language famous for its top-notch concurrency and fault tolerance and was built to be that way. Lightweight processes, cheap message passing, and share-nothing semantics are some of the key concepts that make it a strength. Like everything in IT, there’s a wide range of trade-offs between these two problems.
An advantage of having a performance problem versus having a concurrency problem is that the former is much easier to track down. thanks to automated tooling like benchmarking and profiling. It’s a common path for Erlang. When you need performance, it offers a very easy-to-use foreign function interface. This is especially common for single-threaded algorithms, where challenges of concurrency or parallelism aren’t important and risks of a memory-unsafe language like C are minimal. The most common examples are hashing operations, protocol encoding and decoding like JSON or XML.
Many years ago, one of my first important contributions to MongooseIM, our extensible and scalable Instant Messaging server, was a few optimisations about JIDs, short for Jabber IDentifiers, the identifiers XMPP uses to identify users and their sessions. After an important cleanup of JID usage, I noticed in profiles that calls to encoding and decoding operations were not only extremely common but also suspiciously slow. So I then wrote my first NIF, implementing these basic and very simple operations in C, and later moving this code to its own repo.
Fast forward a few years, and I had the chance to revert to the original Erlang code for the encoding and later on for the decoding. The same very straightforward Erlang code has become a lot faster than the very carefully optimised C one. But this is a very simple algorithm, so, let’s push the limits to more interesting places.
OTP releases have been getting more and more powerful performance improvements, especially since the JIT compiler was introduced. To put things into context, when I made those NIFs back then, no JIT compiler was available at the time. Ever since the quest to beat C code has become feasible in a myriad of scenarios.
For example, one place where the community has put some emphasis on building ever more performant code was on JSON parsers. Starting from Elixir’s Jason and Poison, we also have Thoas, a pure-Erlang alternative. They all ran their benchmarks competing most dearly against jiffy, the archetypical C solution. Until recently a JSON module was incorporated directly into Erlang. See the benchmarks yourself, for encoding and decoding. It beats jiffy in virtually all scenarios, sometimes by a very good difference.
Here’s a more interesting example. I’ve been working on a MongooseIM extension for FaceIT, the leading independent competitive gaming platform for online multiplayer PvP gamers. They provide a platform where you can find the right matching peers, that keeps the community healthy and thriving, making sure matches are fair and no cheating is allowed, and ultimately, a platform where you can build a relationship with your peers.
These communities are enabled by MongooseIM and need presence lists.
The challenge of managing presence for such a large community is two-fold. First, we want to reduce network operations by introducing pagination and real-time updates of only deltas, so that without any network redundancy your device can update the view on display. Then we want to be able to manage such large data structures within maximal performance.
This is archetypical: a very large data structure that requires fast updates is not where immutable runtime would shine, as “mutations” are in reality copying. So you consider a solution in a classical performant language, for example, C, or even better, a memory-safe one like Rust. Quickly we found a ready solution, a Rust implementation of an indexed sorted set was readily available. A very good option, probably the best around. Let’s have a measurement for reference: inserting a new element at any point in a set of 250K elements takes ~4μs. The question then is, can we give pure Erlang code a chance to compete against ~4μs?
You can find more details about this endeavour in a guest blog I wrote for FaceIT. In a nutshell, the point of disadvantage is the enormous amount of copying that any mutation implies. So the first idea is to have a list of lists instead: this is pretty much a skip-list. A list of lists of lists is just a skip list with more lanes, and all these lanes require constant rebalancing.
Another data structure famous for requiring rebalancing is the balance binary tree and Erlang ships with an implementation of general balanced trees based on this research paper by Prof. Arne Andersson. This will be our candidate. The only thing missing is operations on indexes, as we want to know the position modifications that took place. Wikipedia has a hint here: Order Statistic Trees. In turn, extending `gb_sets` to support indexes took not more than a couple of hours, and we’re ready to go.
5μs!
Adding an element at any point of a 250K list takes on average 5μs! In pure Erlang!
The algorithm has an amortised runtime of `O(1.46*log(n))`, and the logarithm base two of 250K is ~8, that is, on average it will take 12 (exactly 11.68) steps to apply any operation. And including the most unfortunate operations that required the entire 12 steps and some rebalancing, the worst case is still 17μs.
What about getting pages? Getting the first 128 elements of the set takes 92μs. To add a bit more detail, the same test takes 100μs on OTP26, that is, by only upgrading the Erlang version we already get an 8% performance improvement.
We’ve seen a scenario with MongooseIM JIDs where once code is rewritten in C for performance reasons could simply be reverted to the original Erlang code and beat C’s performance; a case with JSON parsing where well-crafted Erlang code could beat the unbeatable C alternative; and at last, a problem where a very large data structure once written in Rust wasn’t necessary as a simple pure-Erlang implementation was just as competitive.
We have a runtime that was born concurrently and evolved to be performant. We have a runtime that is memory-safe and makes the enormously complex problem of concurrency and distribution easy, allowing the developer to focus on the actual business case while massively reducing the costs of support and the consequences of failure. A runtime where performance does not need to be sacrificed to achieve these things. A runtime that enables you to write messaging systems (WhatsApp, Discord, MongooseIM), websites (Phoenix and the insanely performant JSON parsing), and community platforms (FaceIT).
We know how to make this all work, and how to keep the ecosystem evolving. If you want to see the power of Erlang and Elixir in your business case, contact us!
Meet Erik Schön, Managing Director and and Nordics Business Unit Lead at Erlang Solutions. He shares his 2025 highlights and festive traditions.
Attila Sragli explores the BEAM VM's inner workings, comparing them to the JVM to highlight their importance.
Pawel Chrząszcz introduces MongooseIM 6.3.0 with Prometheus monitoring and CockroachDB support for greater scalability and flexibility.