Application Support

How to Ensure Continuous Uptime for Your Critical Applications

20 February 2024

By Leo de Jager

- 4 mins read

You may remember the tech issues British Airways had between 2017 and 2019, ranging from a breach to the failure of multiple IT systems and applications shortly after. Cue USD $230 million in fines and a lot of angry customers, the tip of the proverbial iceberg. While it would be hard to fault BA on any specific point given the sheer size and complexity of its infrastructure, it nevertheless serves as a cautionary tale of how things can go wrong when the tools and systems people rely on fail when needed most.

Just in case you need a quick recap on the dangers associated with downtime of any kind, here’s a list of the most common ones:

Financial impact

The price tag associated with downtime is well-known – anything from a few thousand to hundreds of thousands of dollars per hour, depending on your industry and the size of your business.

Reputational damage

The long-term impact of downtime is better reflected in the cost of reputational damage, which has the potential to outweigh the more immediate financial impact of downtime. I’ve gone into more detail about reputational damage in this post.

Indirect costs

Application downtime can also have indirect costs, with employee productivity, lost opportunities, and damage to existing customer relationships among the most notable.

While individual applications and tech stacks can have unique requirements for continuous uptime, they’re all typically hosted in high availability (HA) environments which have been designed to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. This is often achieved with practices that include the following:

Eliminate single points of failure

A single point of failure (SPOF) is a part of a system which, if it stops working, the entire system stops working. In its simplest terms, this means deploying duplicate components, such as servers, power supplies, and even network paths, to ensure that there is always a backup available in case one fails. It can also apply to having multiple data centres or cloud regions to protect against large-scale events (like natural disasters or power outages) that could affect an entire geographical area.

Employ continuous monitoring

Eliminating all single points of failure isn’t a cure-all for downtime – many things can happen that result in downtime, which is why 24/7 monitoring of the application layer is vital. Monitoring here doesn’t only refer to the infrastructure the application is hosted on, but also the application itself, with an emphasis on critical processes and performance. When issues are identified and addressed proactively, it can prevent many potential problems before they impact end-users.

Ensure scalability

Scalability refers to the ability of a system to handle a growing amount of work, or the system’s potential to be enlarged to accommodate higher workloads. This typically refers to adding more resources such as processing power (CPU), memory (RAM), storage space, network bandwidth, etc. Too little of these resources when they’re needed (e.g. during peak times), and the application will either struggle to perform its functions or go down. Application stability, therefore, relies on the availability of resources, and the potential to add more resources as they are needed.

Employ load balancing

Load balancing should be mentioned in the same breath as scalability since it is also focused on managing the application’s workload. The difference between load balancing and scalability is strategic: scaling in this context refers to the resources of a single system, while load balancing refers to adding more systems to help manage the application’s workload. This can be done to:

manage costs (scaling can become expensive)
add redundancy to eliminate a single point of failure
enhanced performance by eliminating single-system bottlenecks
maintenance or updates can be performed on individual servers without taking the entire application offline

Develop a disaster recovery plan

Even the best-laid plans don’t always work out as expected. Redundancies might fail because of broad-scale natural disasters; cyber attacks, fires, and power outages can topple data centres, and the list goes on. A disaster recovery plan determines what should be backed up in the event of unforeseen circumstances, where responsibility for the backup, its verification, and storage lies, and what processes to follow for the quick restoration of data. The goal, of course, is as little downtime and data loss as possible.

In a Nutshell

Continuous uptime for critical applications is a multifaceted challenge that requires a comprehensive strategy which, at the very least, should include redundancy, real-time monitoring, scalability, load balancing, and disaster recovery planning. These practices allow organisations to build resilient systems capable of responding dynamically to changing demands. This approach not only mitigates the risks associated with downtime but also supports business continuity and service quality.