Azkaban Active-Passive Failover Approach

NOTE: This approach covers the Azkaban Web Server only, High Availability for the metadata database(s) should be handled separately.

To use an active-passive failover configuration for a web server that can only have one instance running at a time, such as Azkaban, you must deploy two separate servers and use an external component to manage failover. One server is designated as the active (primary) node, and the other as the passive (standby) node. This setup is often used for stateful applications, such as Azkaban, which are applications that store user data or session information on the server itself, requiring careful data synchronization to ensure a smooth transition during a failover. 

Key components

Step-by-step implementation

Set up the servers

Install and configure your web application and all dependencies identically on two separate server machines. Only one Azkaban Web Server service should be running at any time, so it is important during setup and testing that you stop the server not being tested.

Install a load balancer or DNS service 

Use a load balancer or a DNS failover service to manage incoming traffic. 

Configure health checks and failover rules 

Set up the monitoring logic for the failover. 

Automate the failover process 

When a failure occurs, the following steps are automated:

  1. The load balancer or DNS service detects that the primary node is no longer responding to health checks.
  2. The traffic is automatically redirected to the standby node.
  3. The standby node’s Azkaban Web Server service is started.
  4. The standby node assumes the role of the primary and begins serving requests.
  5. Since Azkaban can only have one Web Server running then the primary server should not be able to come back online, so a controlled failback is required to stop the passive once the primary server is ready to come online again.

Plan for failback

Once the original active server is repaired, you must plan how to restore it.