Using Elixir and WhatsApp to Fight COVID19
- Erlang Solutions Team
- 7th Apr 2022
- 30 min of reading time
Discover the inside story of how the World Health Organisation’s WhatsApp COVID-19 hotline service was launched in 5 days using Please accept marketing-cookies to watch this video.
My name is Simon de Haan. I’m based in the Netherlands. I am the CTO and Co-Founder of turn.io. turn.io is a software as a service company. We’ve spun out of a nonprofit in South Africa and the nonprofit is called praekelt.org. We have a decade-long history of using mostly SMS and similar texting platforms for social good initiatives in areas such as health and education, employment, civic engagement, and things like that.
Since 2017, we’ve been working with WhatsApp specifically. A turn is a software as a service tool, like I said earlier, for teams to have personal guided conversations that improve lives at scale.
Now, practically what that means is that we help social impact teams scale their work significantly, while not being overwhelmed. The strategy here is quite simple. We connect teams to their audiences over WhatsApp. We help prioritize the key conversations that need the most urgent attention, and we help guide those conversations towards outcomes. Then we track whether or not that’s happening.
If you’re thinking about social impact teams, what type of teams is that? That’s NGOs, nonprofits, social enterprises. In the U.S., they’re called, for example, B Corps, but also very large humanitarian organizations like, for example, the WHO, which I’ll be talking about shortly.
An example of an organisation or initiative that uses turn.io is MomConnect in South Africa. That’s the South African National Department of Health’s Maternal Health Program, which we launched as part of a WhatsApp pilot in 2017.
The Department of Health in South Africa needed to be in regular contact with pregnant women. They needed to be able to triage questions coming in and give guidance according to national policy and keep track of the progress being made with regards to clinic visits, inoculations, nutrition, and later on early childhood development. Now, this started in South Africa. The nonprofit that we’ve spun out of is also based in South Africa so that’s why some of our roots are there.
What this looks like for these kinds of conversations, just to give you an idea, is that, for example, we can send people reminders that their HIV medicine, the ARVs, are available at a clinic and it will help prevent the transmission of HIV to a baby during childbirth. This way you get all sorts of questions that come back.
For example, this is a real example, but it’s not a real profile picture. Questions that you would get are things like, “If I am HIV positive, is it possible to breastfeed my child?” What we do is apply natural language understanding to automatically triage the questions coming in. Here we’ve identified it as a question, and then we’ve matched it with an appropriate answer that’s come back that is then immediately sent back to the mother as the relevant answer for her question.
Some other examples are things like mixed media. For example, a mother has received some medicine from a clinic, and they’ve bought some medicine from a store as well and they’re not quite sure how to go about this. We help them identify what is this question about. In this case, it’s vitamin adherence.
We help the triage process to figure out what is important, what needs to be attended to, who’s the best person who can answer these questions. Just to connect them to a real human, in the case of MomConnect.
There are very specific things around, for example, behaviors. For 10 weeks pregnant, there’s some very clear guidance on what you can do. This is an example of washing your hands. It relates to maternal health. But I think all of us have become fairly experienced over the last six months with regards to the importance of washing hands and things like that. The software is able to track specific messages that relate to specific behaviors. Again, this is using machine learning models and natural language understanding to just basically do that matching.
Another example here is when we’re sending a reminder for a mother where we tell them, “Hey, your child is so and so old, you need to send them or schedule your inoculations or your vaccinations. This is important why, and this is where you can do it.”
Then we get a message back “Hey, we already went to the clinic on the 1st of August, and things are fine.”Then we can track that as a, “Okay, this is a six-week immunization, that step’s been completed and everything is on track, there’s nothing else needed.”
This is what we started with originally in South Africa with MomConnect, it has its roots in SMS, and we started introducing that as a pilot with WhatsApp. Just out of an anecdote, we were seeing the behavior between SMS and WhatsApp is entirely different immediately when we started allowing WhatsApp. This is hard to comprehend for people in the U.S., for example.
But as soon as we started allowing messaging over WhatsApp, we saw that the volume ratio of SMS to WhatsApp was 1 to 10 immediately, which was quite incredible. That also informed our decisions moving forward, “Okay, this is a different thing than what we’ve done before, we need different infrastructure to address this kind of volume and usage patterns.”
That worked and building on those experiences, we launched COVID Connect in South Africa, which is the world’s first COVID-19 hotline for a government on the WhatsApp Business API. That reached 500,000 unique users before the official launch and it has since grown to 8 million subscribers.
This was the world’s first government COVID-19 hotline where people could receive accurate information on the state of COVID-19 in their country, what the guidance was,etc. That led to our work with the WHO. Based on previous work, we had a very long-standing relationship with the WHO on various initiatives. So, things came together with Facebook, WHO, and ourselves and we were asked to help launch the service in pretty much the space of a week. It was the world’s first launch of this size.
What made it pressing was that in the area of COVID-19, misinformation can spread faster than the virus itself. So, when there’s an entirely new situation like, for example, COVID-19, what we learn, what science knows about it, what the best guidance is can change on a day-by-day basis. For instance, I’m based in the Netherlands. The guidance of this past week is different again from the last two weeks, and is different again from three months ago.
Having a system out there that’s accurate, that people can access and know what their latest stats are, what the latest guidance is, what the latest understanding of the viruses is, is just extremely important in a global pandemic like we are still finding ourselves in.
The use case is quite simple. For those of you who do have WhatsApp, you can scan the QR code with your WhatsApp client or with your normal iOS camera, and it’ll launch the service for you. You can interact with the service, and you can get live updated case numbers per country which integrates with the dashboards from John Hopkins and retrieves that information there.
You have the latest news relating to COVID-19 coming from the WHO. You’ve got information on basically how to combat misinformation around COVID-19. There’s been plenty of that around, like, what works, what doesn’t work, playing on people’s fears, and things like that. The bot fulfills a substantial role there as well. While I’m talking, feel free to interact with the service. It should be responsive for you. If not, then well, you all know what a technical demo is like.
On launch day, this is largely what it looked like. And first, there was an announcement by Director-General Tedros, which you can see the first spike there where they announced the service. Then there was a second announcement a few hours later from Mark Zuckerberg, who posted it on his Facebook page. Apparently, Mark has a lot of followers because immediately that caused quite an increase in messages a second being processed through the system.
Now, this is on a single cluster, this graph. We had multiple clusters in various zones around the world to deal with the traffic. So, the story here is about what did it look like to build a service like this in such a short amount of time? what were the tools worked? what helped? what are our learnings? Hopefully, you’ll take something away from this talk that helps you in your day-to-day work.
I’m hoping we don’t have many more of these global pandemics where these learnings would apply, but I’m hoping that there are more learnings for you that you can take away in your day-to-day work as well.
This is the timeline of things. COVID-19 work started on the 9th of March. We had a soft launch around the 15th of March, which was, to be honest, quite accidental. I’ll talk a little bit more about that later. And then the official launch was on the 18th of March. Now, that was the South African version, the one that scaled to now about eight million.
The work for the WHO started on the 11th. The WHO infrastructure was ready on the 17th. The soft launch was around the 18th, somewhere around there, and the public launch was on the 20th of March. The infrastructure for the WHO was on AWS. So, part of the work there was not necessarily building the application, but just making sure the infrastructure was up and running and ready to go.
The amazing thing was that 10 million people used the service in just over 48 hours, which is a testament both to the reach of WhatsApp, but also of just the tools and infrastructure and the Elixir, and Phoenix framework that made this possible for us. We were proud and still are proud of the service we’ve built. And this is definitely for us the first time that service has scaled to these numbers in such a relatively short amount of time.
So, I have an announcement as well. Again, this is me playing on one of our learnings around soft launches that I’ll elaborate on a little bit later. World Mental Health Day is coming up and the WHO is launching a service specifically for that.
Given the past experiences of emergencies, everyone knows the mental health and psychosocial support help that’s needed in time of an emergency. Especially during COVID-19, with everyone either being in quarantine or isolated in some kind of way, expectation is that the need for psychosocial support and mental health support is only going to increase.
For the WHO, as an extension of the original COVID-19 work we’ve done, we’re also launching a digital guide to stress management. Again, you can scan the QR code, or if you’re already using the service, just type the word breath and it’ll launch the service for you.
Basically, this just takes you through several exercises for basically stress management. So, you can type Start, and it’ll start your stress management journey and take you through a few days of just basically grounding stress management tools to help you deal with anything stress-related, which may be partially built on the Elixir stack.
This is the first time we’re actually building things that are more stateful than we’ve done before using this infrastructure and so this is a new thing. This hasn’t been public. There’s been a few press releases, but it hasn’t received a big press push yet. So many of you will be seeing this for the first time. I’m hoping that that will be of use to many people.
The way this is built…and as I said earlier, is a bit of a departure from earlier designs, it’s far more stateful, is more complex. This is our first run with this approach. It allows you to turn to build more threaded sequential interactions, text-based interactions with the service. You can see on the left, there’s a bunch of actions, on the right, there’s the conversation that’s being modelled. I’ll touch on that a little bit later as well.
All the other stuff we’ve done was very stateless and that also made it quite a bit easier to scale. This is the first time that we’re doing more stateful things and so we’re confident it’ll hold up. But it’s a new piece of technology and that’s always exciting.
The stack is actually quite simple. For the WHO, it’s just Kubernetes on Amazon Web Services provisioned with Terraform. We don’t have any specific alliance or preference for any of these large hosters. Kubernetes still feels very, very academic to us but it provides a useful base platform for deploying, almost treating it as an operating system. And it’s worked well for us so we don’t have any complaints there.
So other than that, it’s Postgres 9.6 Elasticsearch, Faktory which is a queue worker, and then Elixir 1.10.4., on OTP 22. I don’t know if Elixir is boring. But the other ones, certainly Postgres, are a bit of a boring technology. We love Postgres. It’s extremely solid, it’s a very reliable workhorse for the work we’ve done. It’s performed incredibly well.
Faktory. Those that don’t know and maybe those from the Ruby community would be familiar with Sidekiq, the job worker. Faktory is from the same author and it’s a language-agnostic job worker that works very well with…well, certainly we found it working very well.
Digging a little bit further into the stack, we’re using the Phoenix web framework, and we’re using React on the frontend. Phoenix is working extremely well for us. We’re not doing anything relatively new or fancy there with regards to the new releases stuff mostly because a bunch of the code that we’ve written for this predates that. These are just deployed as Docker containers running the Phoenix app. Then there’s a React frontend which is managed and deployed via Netlify.
Then we use Absinthe for GraphQL on the backend and Apollo on the frontend. All of that communication happens via WebSockets. Data is managed with DataLoader using GraphQL. Arc for storage to either S3 or Google Cloud Storage depending on the hosting environment we’re working on.
We’re using the combination of Quantum and Highlander to schedule jobs like, for example, cron, or recurring jobs. We had an issue at one point where we almost issued a distributed denial-of-service on ourselves because it was quite difficult at times to limit the number of processes that Quantum ran on a schedule to just a single node in a cluster. But Highlander solved that beautifully for us.
So, if you’re looking for a way to combine jobs running on a schedule in a clustered environment, but you only want it running on one node at a single time, Highlander will help you there. We’re using a library called ExRated just for rate-limiting all API endpoints and FaktoryWorkerEx to talk to the Faktory job server.
The stack itself looks like this: we have the load balancer, which is generally provided by the hosting environment, SSL, and all that stuff is terminated there. Then the Phoenix app which is, you know, just a straight-up normal Phoenix app, there’s nothing really special about it. Those are all auto-scaled within limits based on CPU thresholds using Kubernetes. Then automatically clustered with lib cluster using the Kubernetes strategy. So that works extremely well for us. Both the FaktoryWorkers and the Phoenix app are all joined in one big cluster.
Then the WhatsApp Business API for those that aren’t familiar with it, it’s several Docker containers that you need to run that take care of all of the end-to-end encryption and things like that. And for us, for the WHO, we run this with 32 shards on the WhatsApp Business API. And there’s a bunch of stateful services like Faktory server, Elasticsearch, and Postgres.
This stack was replicated in multiple zones around the world to ensure load balancing. And as many of you know or maybe already have seen that the QR code just points out a URL. And so, WhatsApp conversations can be started with a URL. And what we did was we used a Bitly link to round-robin between different clusters to spread the load.
So, if you open the link, you first went to essentially a serverless cloud function which then hand-picked one of the various clusters around the world and assigned you to that one. Which helped us manage the load across these various installations for the launching of the service.
I think what’s impressed us is just the ease of clustering of BEAM nodes. Many of you are working with Elixir or have been working with Elixir for, I don’t know how many years already. This is probably old news.
We come from a Python background and so turn.io is our team’s first production of Elixir environment. Some of these things that were, really hard problems in Python just don’t exist in Elixir. So simple things like publishing WebSockets via GraphQL subscriptions from any BEAM node are just so easy. It just feels almost unfair if you’re coming from an environment that doesn’t have that clustering idea built in.
So subconsciously, there’s a whole set of problems that you’re almost inclined to not even approach simply because the language doesn’t allow that for you, or the underlying infrastructure doesn’t allow that for you.
For us, working with Elixir in many ways feels magical, not in a bad magical, like code magical way. But just like, “Wow, there’s, like, this whole new world of opportunities that we previously weren’t thinking about, that now are available to us.” Which is quite incredible.
The other thing that worked well is network control. We’ve stopped worrying about processes. Turn.io is a heavy network application. In some of our earlier Python applications, we had things like long-running network connections that were often problematic and forced us to make things asynchronous.
Then if you’re running things asynchronously, you introduce a whole bunch of other different problems like backpressure, or rate-limiting, that just become difficult because you still need to communicate between these various systems. Elixir as a language made those problems much, much simpler for us to reason about. I feel like it gives us a really good set of tools to manage those problems.
Another thing that worked well is monitoring. Now, this is not specifically Elixir I guess, but there are some great libraries and the BEAM VM allows you to introspect your processes very well. So, if you’re gonna run things at scale, or high volume, really invest in your monitoring and your observability.
Prometheus and Grafana are immensely valuable and will highlight upcoming problems. We use Zipkin to just get insights into delays when they happen.
Parts of Turn are pretty distributed as I was showing earlier. So being able to see which code paths are slow, Zipkin highlighted that to us. Now, on top of that, with Prometheus and Grafana, escalations through PagerDuty were very straightforward and worked extremely well.
Automation worked well. Again, many of these things are almost…I don’t know, just very simple if you mention them. But these are still the kinds of learnings that you do when you build a system like this or the learnings that you get.
So, if you’re running a small team, really invest in automating as much as possible. The value of a good CI/CD setup compounds over time. Our team size at launch was tiny. Right now, we’re about seven developers, I think. But automation, it felt like it added another team member to our team or several team members that we didn’t have to worry about whether our deploys were going through, we didn’t have to worry about versioning things.
And so many of the things that historically, I would have a team for to do automation, and the tools that are available now just didn’t require that.
So right now, production releases are built and deployed within three minutes of a tagged commit. QA releases are built and deployed on each commit. As a result of automation, our deploys are smaller and less stressful as a result.
Feature Flags, identify the things that you can live without, and make it easy to turn them off. For launch, we disabled live Elasticsearch indexing. This is both the thing that worked well and that didn’t work well. So, we disabled media support, we kept everything within the service as stateless as possible.
For Feature Flags, Elixir’s pattern matching made this very easy in the codebase. There are specific things we don’t want this to happen, set a flag, skip that code path entirely, and then just continue. So that’s what allowed us to disable Elasticsearch very easily. I’ll touch a little bit more on Elasticsearch later.
Load Test. So, load tests all the critical paths extensively. Make it easy to do so repeatedly so that you can track the effectiveness of the changes that you’re making. Again, these things are very logical but if you’re under stress, and you’re needing to deploy a thing within a couple of days, for a global audience, these are the things that you likely will forget to do but do need to pay attention to.
We load tested the application to 1,800 requests a second which was on a cluster, which is more than double our expected maximum. With that, we ensured that the response times remained below 100 milliseconds. We used loader.io to run those load tests.
Faktory, I touched on it earlier, the job server has been extremely reliable for us. For one of our clusters, it’s processed 1.7 billion events. Historically, we would have defaulted to something like RabbitMQ, but Faktory gives us retries with exponential backoff out of the box, which is extremely convenient. I know you can build these things on RabbitMQ and it’ll work well. It’s just one of those things that we now didn’t need to build. We’re very grateful for Faktory and then the team that did that.
So now some things that did not work. Again, we touched on that a little bit earlier, the Feature Flags. So Elasticsearch is a great piece of software, but in our experience, it’s difficult to run from an operational perspective. We are confident we could have done it but we didn’t want to have to focus on that and so we disabled it for launch again.
It’s kind of a what worked and what didn’t work combination here. It’s just one of those things if you don’t need to worry about it, don’t worry about it for launch, and make sure you could turn it back on when you do need it.
Other things that didn’t work were half automated. I say this, and I’ve heard other people say it as well. Broken gets fixed, but shitty last forever. Some operational things were not automated the way they should have been in Terraform.
There are no squeaky-clean production environments. It just does not exist. Certainly, if you’re not under massive time pressure to deliver something because there’s a global pandemic. But that said, one can certainly work towards making sure things are not messy.
The reality is that some things will always be messy in a production system and it’s really hard to detangle those things. So right now, six months in, we’re still trying to detangle some of the shortcuts that we took with regards to deployments that aren’t working in the way that we would like them to work.
If you can avoid half automating things, then do so. Sometimes it’s better not to automate them, and then bite the bullet and do it well than do it half. Because if you’re doing it half, it’s always gonna last longer than you would expect.
Soft launches are vital. This was accidental, to be honest. So, I’m saying here, we always made sure that we were seeing high volumes before any major public launches. In a way, that’s true for every single launch except for the first one. The first one was an accident where a Department of Health representative tweeted about services pre-launch.
What that did was that basically, it gave us about 24 hours to stress test the system and see that it’s working, observe it in production before any media attention was focused on it. You will help your team to do this because it relieves the stress of big bang launches.
Soft launches are vital. So, everyone who’s now testing our stress guidance on the WHO service, thank you very much. If there are bugs, you’re helping us catch them before actually going live globally.
The other thing that we’ve learned, certainly for first launches is to keep things simple. Everyone knows this, everyone repeats it, it’s hard to do in production in real-life use cases. The reality is that simple applications are just way much easier, way easier to scale. So, we only launched with a single language, zero stateful interactions, it was just keyword response things, no media support, and no search index because we turned that off.
There were quite several other services that were launched during the same time, national services, government services, and regional services, pretty much all of them suffered a significant amount of downtime, simply because they couldn’t keep up with the load or they weren’t prepared. The WHO service was the largest one at the launch of any of them and was also the only one that stayed up the whole time.
For a large part that is extremely well for us. Phoenix worked extremely well. But part of that is also just strategy, like, keeping your application simple.
The other thing is a plan for surges. We defaulted to being at 5% of capacity at all times. So, traffic tends to be quite bursty as a result of television coverage or social media activities. Having capacity at 5% gave us the headroom where we could scale up what was needed just to deal with the surges and the demand.
In a way, this spare capacity could feel like a waste. On the other side, the spare capacity made sure we were able to scale up and provide a valuable service to people in a hopefully once in a century global pandemic.
The other thing is to ask for help. We’re a very small team and we needed help to pull this off. A huge amount of credit goes to the team of experts at both Amazon Web Services and WhatsApp, who worked alongside us for the biggest installation of this WhatsApp Business API at launch.
Practically what this looked like is that we had multiple WhatsApp groups open, constantly open Zoom calls, and exchanging insights while we were all observing the system and how it worked. So ask for help even if you’re an expert in a field. Still ask for help. Don’t go for it solo.
Now, the other thing is, I think there’s a really big thank you that we are constantly wanting to express as a team. So, when this was going on, and you all saw the timeline, it was just a couple of days, we didn’t have a lot of time to prepare this.
We sent out an email to several smaller teams who’d asked to be on standby should help be necessary. Mostly just saying, “Hey, you know, you’re delivering this piece of software, or you’re responsible for this piece of software, or you’re part of the team that manages this. It’s a critical piece of infrastructure for the world’s first global response to the COVID pandemic on WhatsApp. Would you please be…at least be aware of the fact and make yourself available should anything break or need attention?”
This was part of us reaching out for help in the community. And it was incredible. Everyone showed up within 24 hours. So, this is just a testament to the Elixir community. Many of you, I suppose, in many ways as well.
For this launch, we just wanna say a really big thank you to Dashbit for the Elixir advice. Jose jumped on a call, the team helped out, did some reviews, and gave some advice on how we could optimize things.
Sentry for the error reporting tools. We couldn’t have done it without that. Contribs for the Faktory job worker. 1.7 billion events on a single cluster is not a small number. And the team at Rasa for the natural language understanding.
Now further than that, also the Elixir community just for all the tooling that made this possible. The various clients we’re using for various things, rate-limiting, all of the stuff that all of this relies on. Phoenix, Ecto, Elixir itself which just released 1.11. which we’re very excited about.
If you’d like to learn more about Turn.Io you can visit their website. If you’d like to find out more about how Elixir can empower you to have massively scalable solutions ready-for-market in rapid speed, talk to us. And, if you’d like to hear the latest Elixir case studies, features and frameworks join us at ElixirConf EU 2022.
Phuong Van explores Phoenix LiveView implementation, covering data migration, UI development, and team collaboration from concept to production.
Learn how Erlang Solutions helped companies like TV4, FACEIT, and BET Software overcome tech challenges and achieve success.
Fintech open source is transforming the industry, offering flexible, scalable, and cost-effective solutions for businesses looking to innovate and stay competitive.