Why do systems fail? Tandem NonStop system and fault tolerance
- Carlo Gilmar
- 3rd Oct 2024
- 11 min of reading time
If you’re an Elixir, Gleam, or Erlang developer, you’ve probably heard about the capabilities of the BEAM virtual machine, such as concurrency, distribution, and fault tolerance. Fault tolerance was one of the biggest concerns of Tandem Computers. They created their Tandem Non-Stop architecture for high availability in their systems, which included ATMs and mainframes.
In this post, I’ll be sharing the fundamentals of the NonStop architecture design with you. Their approach to achieving high availability in the presence of failures is similar to some implementations in the Erlang Virtual Machine, as both rely on concepts of processes and modularity.
Why do systems fail? This question should probably be asked more often, considering all the factors it involves. It was central to the NonStop architecture because achieving high availability depends on understanding system failures.
For tandem systems, any system has critical components that could potentially cause failures. How often do you ask yourself how long can your system operate before a failure? There is a metric known as MTBF (mean time between failures), which is calculated by dividing the total operating hours of the system by the number of failures. The result represents the hours of uninterrupted operation.
Many factors can affect the MTBF, including administration, configuration, maintenance, power outages, hardware failures, and more. So, how can you survive these eventualities to achieve at least virtual high availability in your systems?
High availability in hardware has taught us important insights about continuous operation. Some hardware implementations rely on decomposing the system into modules, allowing for modularity to contain failures and maintain operation through backup modules instead of breaking the whole system and needing to restart it. The main concept, from this point of view, is to use modules as units of failure and replacement.
But what about the software’s high availability? Just as with hardware, we can find important lessons from operative system designers who decompose systems into modules as units of service. This approach provides a mechanism for having a unit of protection and fault containment.
To achieve fault tolerance in software, it’s important to address similar insights from the NonStop design:
Can you recognise some similarities so far?
The NonStop architecture essentially relies on these concepts. The key to high availability, as I mentioned before, is modularity as a unit of service failure and protection.
A process should have a fail-fast mechanism, meaning it should be able to detect a failure during its operation, send a failure signal and then stop its operation. In this way, a system can achieve fault detection through fault containment and by sharing no state.
Another important consideration for your system is how long it takes to recover from a failure. Jim Gray, software designer and researcher at Tandem Computers, in his paper ”Why computers stop and what can be done about it?” proposed a model of failure affected by two kinds of bugs: Bohrbugs, which cause critical failures during operation, and Heisenbugs, which are more soft and can persist in the system for years.
The previous categorisation helps us to understand better strategies for implementing processes-pairs design, based on a primary process and a backup process:
All of these insights are drawn from Jim Gray’s paper, written in 1985 and referenced in Joe Armstrong’s 2003 thesis, “Making Reliable Distributed Systems in the presence of software errors”. Joe emphasised the importance of the Tandem NonStop system design as an inspiration for the OTP design principles.
So if you’re a software developer learning Elixir, you’ll probably be amazed by all the capabilities and great tooling available to build software systems. By leveraging frameworks like Phoenix and toolkits such as Ecto, you can build full-stack systems in Elixir. However, to fully harness the power of the Erlang virtual machine (BEAM) you must understand processes.
Just as the Tandem computer system relied on transactions, fault containment and a fail-fast mechanism, Erlang achieves high availability through processes. Both systems consider it important to modularise systems into units of service and failure: processes.
A process is the basic unit of abstraction in Erlang, a crucial concept because the Erlang virtual machine (BEAM) operates around this. Elixir and Gleam share the same virtual machine, which is why this concept is important for the entire ecosystem.
A process is:
Just remember, these are the fundamentals of Erlang, which is considered a message-oriented language, and its virtual machine (BEAM), on which Elixir runs.
If you want to read more about processes in Elixir I recommend reading this article I wrote: Understanding Processes for Elixir Developers.
I consider it important to read papers like Jim Gray’s article because they teach us the history behind implementations that attempt to solve problems. I find it interesting to read and share these insights with the community because it’s crucial to understand the context behind the tools we use. Recognising that implementations exist for a reason and have stories behind them is essential.
You can find many similarities between Tandem and Erlang design principles:
Take some time to read about the Tandem computer design. It’s interesting because these features share significant similarities with OTP design principles for achieving high availability. Failure is something we need to deal with in any kind of system, and it’s important to be aware of the reasons and know what you can do to manage it and continue your operation. This is crucial for any software developer, but if you’re an Elixir developer, you’ll probably dive deeper into how processes work and how to start designing components with them and OTP.
Thanks for reading about the Tandem NonStop system. If you like this kind of content, I’d appreciate it if you shared it with your community or teammates. You can visit this public repository on GitHub where I’m adding my graphic recordings and insights related to the Erlang ecosystem or contact the Erlang Solutions team to chat more about Erlang and Elixir.
Illustrations by Visual Partner-Ship @visual_partner
Jaguares, ESL Americas Office
@carlogilmar
Meet Erik Schön, Managing Director and and Nordics Business Unit Lead at Erlang Solutions. He shares his 2025 highlights and festive traditions.
Attila Sragli explores the BEAM VM's inner workings, comparing them to the JVM to highlight their importance.
Pawel Chrząszcz introduces MongooseIM 6.3.0 with Prometheus monitoring and CockroachDB support for greater scalability and flexibility.