Balancing Innovation and Technical Debt
- Nelson Vides
- 23rd May 2024
- 13 min of reading time
Let’s explore the delicate balance between innovation and technical debt.
We will look into actionable strategies for managing debt effectively while optimising our infrastructure for resilience and agility.
I was having this conversation with a close acquaintance not long ago. He’s setting up his new startup, filling a market gap he’s found, rushed before the gap closes in. It’s a common starting point for many entrepreneurs. You have an idea you need to implement, and until it is implemented and (hopefully) sold, there is no revenue, all while someone else can close the gap before you do. Time-to-market is key.
While there’s no revenue, you acquire debt. But while reasonably careful to keep it under control, you pay the Financial Debt off with a different kind of debt: Technical Debt. You choose to make a trade-off here, a trade-off that all too often is taken without awareness. This trade-off between debts requires careful thinking too, just as much as financial debt is an obvious risk, so is a technical one.
Let’s define these debts. Technical is the accumulated cost of shortcuts or deferred maintenance in software development and IT infrastructure. Financial is the borrowing of funds to finance business operations or investments. They share a common thread: the trade-off between short-term gains and long-term sustainability.
Just like financial debt can provide immediate capital for growth, it can also drag the business into financial inflexibility and burdensome interest rates. Technical debt expedites product development or reduces time-to-market, at the expense of increased maintenance, reduced scalability, and decreased agility. It is an often overlooked aspect of a technological investment, whose prompt care can have a huge impact on the lifespan of the business. As an enterprise must manage its financial leverage to maintain solvency and liquidity, it must also manage its technical debt to ensure the reliability, scalability, and maintainability of their systems and software.
Consider the example of a rapidly growing e-commerce platform: appeal attracts demand, demand requires resources, and resources mean increased vulnerability: the increasing user data and resources attract threats, aiming to disrupt services, steal sensitive data, or cause reputational harm. In this environment, the platform’s success is determined by its ability to strike a delicate balance between serving legitimate customers and thwarting malicious actors, where both play ever-increasing proportions.
Early on, the platform prioritised rapid development and deployment of new features; however, in their haste to innovate, the technical team accumulated debt by taking shortcuts and deferring critical maintenance tasks. What results from this is a platform that is increasingly fragile and inflexible, leaving it vulnerable to disruptive attacks and more agile competitors. Meanwhile, reasonably, the platform’s financial team kept allocating capital to funding marketing campaigns, product launches, and strategic acquisitions, under pressure to maximise profitability and shareholder value; however, they neglected to allocate sufficient resources towards cybersecurity initiatives, viewing them as discretionary expenses rather than critical investments in risk mitigation and resilience.
If we’re talking about debt, and drawing a parallel with financial terms, let’s complete the parallel. By establishing the concept of currencies, we can build quantifiable metrics of value that reflect the health and resilience of digital assets. Code coverage, for instance, measures the proportion of codebase exercised by automated tests, providing insights into the potential presence of untested or under-tested code paths. In this line, tests and documentation are the two assets that pay the highest technical debt.
See for example how coverage for MongooseIM has been continuously trending higher.
Similarly, Continuous Integration and Continuous Deployment (CI/CD) pipelines automate the process of integrating code changes, running automated tests, verifying engineering work, and deploying applications to diverse environments, enabling teams to deliver software updates frequently and with confidence. By streamlining the development workflow and reducing manual intervention, CI/CD pipelines enhance productivity, accelerate time-to-market, and minimise the risk of human error. Humans have bad days and sleepless nights, well-developed automation doesn’t.
Additionally, valuations on code quality that are diligently tracked on the organisation’s ticketing system provide valuable insights into the evolution of software assets and the effectiveness of ongoing efforts to address technical debt and improve code maintainability. These valuations enable organisations to prioritise repayment efforts, allocating resources effectively.
The longer any debt remains unpaid, the greater its impact on the organisation — (technical) debt accrues “interest” over time. But, much like in finances, a debt is paid with available capital, and choosing a payment strategy can make a difference in whether capital is wasted or successfully (re)invested:
Repayment assets are resources or strategies that can be leveraged to make debt repayment financially viable. Here are some key repayment assets to consider:
While debt is a quintessential aspect of entrepreneurship, acquiring it unwisely is obviously shooting in one’s foot. You’ll have to make many decisions and choose over many trade-offs, so you better be well-informed before putting your finger on the red buttons.
Whether you choose one vendor over another or decide to go self-hosted, use containerised technologies, so that future changes to better infrastructures are possible. Containers also provide a consistent environment for development, testing and production. Choose technologies that are good citizens in containerised environments.
Whether you choose one or another hardware architecture or any amount of memory, use runtimes that can efficiently use and adapt to any given hardware, so that future changes to better hardware are fruitful. For example Erlang’s concurrency model is famous for automatically taking advantage of any number of cores, and with technologies like Elixir’s Nx you can take advantage of esoteric GPUs and TPUs hardware for your machine learning tasks.
The market will push your offerings to its limit, in a never-ending stream of requests for new functionality and changes to your service. Your code will need to change, and respond to changes. From Elixir‘s metaprogramming and language extensibility to Gleam‘s strong type-safety, prioritise tools that likewise aid your developers to change things safely and powerfully.
There are two philosophies in the culture of error handling: either it is mathematically proven that errors cannot happen – Haskell’s approach – or it is assumed they can’t always be avoided and we need to learn to handle them – Erlang’s approach. Wise technologies take one starting point as an a-priori foundation of the technology and, a-posteriori, deal with the other end. Choose wisely your point on the scale, and be wary of technologies that don’t take a safe stance. Errors can happen: electricity goes down, cables are cut, and attackers attack. Programmers have bad sleepless nights or get sick. Take a stance, before errors bite your service.
No fancy unique idea will sell if it can’t be bought, and no service will be used if it is not there to begin with. Unavailability takes an exponential toll on your revenue, so prioritise availability. Choose technologies that can handle not just failure, but even upgrades (!), without downtime. And to have real availability, you always need at least two computers, in case one dies: choose technologies that make many independent computers cooperate easily and can take over another’s work transparently.
A chat system, like many web services, handles a countably infinite number of independent users. It is a heavily network-based application that needs to respond to requests that are independent of each other in a timely and fair manner. It is an embarrassingly parallel problem, messages can be processed independently of each other, but it is also a challenge of soft real-time properties, where messages should be processed sufficiently soon for a human to have a good user experience. It also faces the challenge of bad actors, which makes requests blacklisting and throttling necessary.
MongooseIM is one such system. It is written in Erlang, and in its architecture, every user is handled by one actor.
It is containerised, and easily uses all available resources efficiently and smoothly, adapting to any change of hardware, from small embedded systems to massive mainframes. Its architecture uses the Publish-Subscribe programming pattern heavily, and because Erlang is a functional language, functions are first-class citizens, and therefore functions are installed to handle all sorts of events extensively because we never know what new functionality we will need to implement in the future.
One important event is a new session starting: mechanisms for blacklisting are plenty, whether they’re based on specific identifiers, IP regions, or even modern AI-based behaviour analysis, we can’t predict the future, so we simply publish the “session opened” event and leave for future us to install the right handler when is needed.
Another important event is that of a simple message being sent. What if bad actors have successfully opened sessions and start flooding the system, consuming the CPU and Database unnecessarily? Again, changing requirements might dictate the system is to handle some users with preferential treatment. One default option is to slow down all message processing within some reasonable rate, for which we use a traffic shaping mechanism called the Token Bucket algorithm, implemented in our library Opuntia – named that way because if you touch it too fast, it stings you.
You can read more about how scalable MongooseIM is in this article, where we pushed it to its limit. And while we continuously load-test our server, we haven’t done another round of limit-pushing since then, stay tuned for a future blog when we do just this!
Technical Debt has an inherent value akin to Financial Debt. Choosing the right tool for the job means acquiring the right Technical Debt when needed – leveraging strategies, partnerships, and solutions, that prioritise resilience, agility, and long-term sustainability.
Pawel Chrząszcz introduces MongooseIM 6.3.0 with Prometheus monitoring and CockroachDB support for greater scalability and flexibility.
Here's how machine learning drives business efficiency, from customer insights to fraud detection, powering smarter, faster decisions.
Phuong Van explores Phoenix LiveView implementation, covering data migration, UI development, and team collaboration from concept to production.