Azkaban Active-Passive Failover Approach

NOTE: This approach covers the Azkaban Web Server only, High Availability for the metadata database(s) should be handled separately.

To use an active-passive failover configuration for a web server that can only have one instance running at a time, such as Azkaban, you must deploy two separate servers and use an external component to manage failover. One server is designated as the active (primary) node, and the other as the passive (standby) node. This setup is often used for stateful applications, such as Azkaban, which are applications that store user data or session information on the server itself, requiring careful data synchronization to ensure a smooth transition during a failover.

Key components

Active server (Primary node): Actively handles all incoming web traffic and client requests.
Passive server (Standby node): Azkaban Web Server, stays off but ready to start. A process on this server monitors the active server's health using a "heartbeat" signal. If the heartbeat is lost, it initiates a failover process, essentially starting the passive server up.
Heartbeat: A regular, low-level signal sent from the active server to the passive server to confirm it is still operational.
Load balancer or DNS controller: Manages the routing of traffic to the active server. It detects a failure and reroutes traffic to the passive server during a failover.
Shared storage or data synchronization: For stateful applications, a consistent data source is crucial. For Azkaban Web Server access to the Azkaban database is sufficient.

Step-by-step implementation

Set up the servers

Install and configure your web application and all dependencies identically on two separate server machines. Only one Azkaban Web Server service should be running at any time, so it is important during setup and testing that you stop the server not being tested.

Install a load balancer or DNS service

Use a load balancer or a DNS failover service to manage incoming traffic.

Load balancer: Place a hardware or software-based load balancer in front of the two web servers. Configure it to send all traffic to the active server. The load balancer will perform health checks on both servers to detect if the active one fails.
DNS failover service: Use a service like Amazon Route 53 or another managed DNS provider. You can configure a health check that, upon failure, automatically changes the DNS record to point to the passive server's IP address. This method often has a longer failover time due to DNS propagation delays.

Configure health checks and failover rules

Set up the monitoring logic for the failover.

Heartbeat monitoring: The passive server (or the load balancer) continuously monitors the active server's health. The easiest way to do this is with a simple health check endpoint on the web server (e.g., https://your-app/healthz).
Failover trigger: The failover system is triggered when the health check or heartbeat from the active server fails. The system then directs all traffic to the passive server, which becomes the new active server. The Azkaban Web Service on the passive server should be started as a part of this failover process.

Automate the failover process

When a failure occurs, the following steps are automated:

The load balancer or DNS service detects that the primary node is no longer responding to health checks.
The traffic is automatically redirected to the standby node.
The standby node’s Azkaban Web Server service is started.
The standby node assumes the role of the primary and begins serving requests.
Since Azkaban can only have one Web Server running then the primary server should not be able to come back online, so a controlled failback is required to stop the passive once the primary server is ready to come online again.

Plan for failback

Once the original active server is repaired, you must plan how to restore it.

Manual failback: Schedule a maintenance window to manually fail back to the original primary server. This avoids disruption during high-traffic periods.
Synchronize data: Before failing back, ensure that any data changes made on the new active server are replicated back to the original server. For Azkaban as long as the PostgreSQL DB is still up to date (i.e. there hasn’t also been a failover of the database) then non synchronization is required.
Make passive: After synchronization, the original server is placed in passive mode and should stop the Azkaban Web Server service resuming its standby role.