Cloud Malaya Lab
  • Home Cloud Lab
    • Architecture
    • DNS Server
  • DevOps & DevSecOps
    • Kubernetes
      • Fundamental
        • Taints in Kubernetes
        • Kubernetes Observability (Monitoring) Guide
        • DNS in Kubernetes (CoreDNS)
        • Storage in Kubernetes
          • Container Storage Interface - CSI
          • Ephemeral Storage Explain
        • Cluster CPU & Memory Inspections
        • Open Source and Free Storage Solution Kubernetes
      • PoC Lab
        • Configure Access to Multiple Cluster using KUBE_CONFIG
        • High Availability Portable Mini Kubernetes Data Center with K3s
          • Lab 1: Build the K3s Infrastructure
          • Lab 2: Automate the K3s clusters deployment with Ansible
          • Lab 3: Multi Master K3s kubeconfig files for smooth accessibility
          • Lab 4: Network Configurations & Tools
        • AWX/Ansible Tower Deployment on Kubernetes
          • Lab 1 - Understanding Requirements
          • Lab 2 - AWX,AWX Operator Configurations & Deployments
          • Lab 3 - Ingress Networking & Connectivity via Nginx
          • Lab 4: AWX Web Console & accessibility on LAN
        • Troubleshooting: Ansible Tower - Reset admin password
    • Monitoring & Observability
      • Splunk
        • Splunk Deployment for SME (Malay Version)
    • System Administration - Linux
      • RedHat Derivatives
        • Add User to Sudoers
        • Yum Update vs Yum Upgrade
        • DNF update vs upgrade
        • SMP PREEMPT_DYNAMIC Definitions
      • Virtualization
        • KVM
      • Storage
  • Digital Infrastructure & Cloud Solutions
    • Solution & System Design
      • Public Cloud
        • Fundamental
          • AWS
          • Azure
          • GCP
        • Lab
          • Simple CD menggunakan Github Action
      • Private Cloud & On-Premise
      • On-Premise to Cloud Design
        • AWS Direct Connect
      • High Availability & Fault Tolerance Design
      • Disaster Recovery
Powered by GitBook
On this page
  1. Digital Infrastructure & Cloud Solutions
  2. Solution & System Design

High Availability & Fault Tolerance Design

Understanding HA & Fault Tolerance

High availability and fault tolerance are two different approaches to ensure the continuous operation and reliability of a system, application, or service. While they both aim to minimize downtime and maintain seamless functionality, they achieve this in distinct ways.

  1. High Availability (HA): High availability focuses on keeping a system or service running without any interruptions or disruptions. The primary goal is to maximize uptime and provide uninterrupted access to users. HA achieves this by setting up redundant components, such as servers, networks, or databases, and distributing the workload across these redundant elements. If one component fails, the load is automatically transferred to the backup, so there is no noticeable impact on the end-users.

Key points of High Availability:

  • Quick detection of failures and automatic failover to the redundant components.

  • Minimal or no downtime during failover processes.

  • Typically suitable for critical systems and applications that require continuous operation.

  1. Fault Tolerance: Fault tolerance, on the other hand, focuses on the ability of a system to continue functioning correctly even in the presence of faults or failures. The emphasis here is on maintaining data integrity and service availability despite hardware or software failures. Fault tolerance involves redundant components and mechanisms that can detect and correct errors or failures to ensure smooth operation.

Key points of Fault Tolerance:

  • Immediate detection of faults and immediate correction or recovery.

  • Continues operating even when a fault occurs, without interrupting service.

  • It's generally associated with systems that cannot afford even the briefest downtime, such as real-time applications or critical infrastructure.

In summary, high availability aims to prevent downtime by quickly switching to redundant components, while fault tolerance emphasizes the ability to recover from faults without any interruption in service. Both approaches are crucial for ensuring reliable and continuous operations in different scenarios, and they are often used together to provide a comprehensive and robust system that can handle unexpected failures and maintain the desired service levels.

PreviousAWS Direct ConnectNextDisaster Recovery

Last updated 7 months ago