11) Fault tolerance and high availability

Introduction to computer resilience

In the modern digital world, an unexpected interruption of services (downtime) can cause incalculable economic and image damage.

Although they are often confused, these two concepts express radically different engineering approaches to achieving reliability.

Fault Tolerance (fault tolerance) guarantees absolute continuity without interruptions, using duplicated hardware in real-time.

High Availability (High Availability) aims to minimize downtime by accepting brief automated failover transitions.

The key rule for eliminating the Single Point of Failure (SPOF) is to duplicate every single critical element of the infrastructure.

At the hardware level, redundant power supplies connected to separate power lines and uninterruptible power supplies (UPS) are used.

Fault tolerance of hard disks relies on RAID technology to protect data in the event of a memory failure.

RAID 1 duplicates data in mirroring on two distinct disks, offering simple but expensive protection.

RAID 5 distributes data and parity blocks across a minimum of three disks, ensuring an excellent balance between capacity and security.

RAID 10 combines the advantages of mirroring and striping to achieve stellar performance and maximum robustness.

To scale web services at the software level, Load Balancers (load balancers) are used to distribute user requests across multiple servers.

Server clusters in Active-Active configuration cooperate simultaneously to redistribute the global computational load.

The Active-Passive configuration provides a backup server ready to take over instantly if the primary server stops responding.

Investing in highly available and fault-tolerant systems is the only way to ensure uninterrupted operation at a global level.