Azkaban Architecture

Each Azkaban Cluster is made of a single web server, one or more executor servers and a PostgreSQL db to track history and state.
Each Executor server is capable of orchestrating a few 10s of flows simultaneously depending on the resource requirements of each flow.
Each RED Job equates to a single Azkaban Flow.
The Executor server polls flows that are ready to run on a frequent basis from PostgreSQL based queue (“dispatch logic”) (Typical config: 1/s).

Azkaban Executor Server Responsibilities

Dispatch

Executor Server polls the db based queue for flows frequently (typically 1/s) contingent on resource availability.
Sets up the flow (Parses all properties, de-serializes the workflow to build an in-memory graph, downloads binaries in Project Cache if missing, allocates resources: thread pool, execution directory etc.) and finally:
Kicks off the orchestration of the flow.

Orchestration

Executor Server manages a thread pool per flow, which allows multiple jobs to run in parallel.
Each job is launched in its own separate process with the flow admin account.
During the orchestration process, the Executor server manages the state machine of the flow, keeps the database up to date with flow/job state and finally flushes flow logs to the database.

Flow Management

Executor server is an end-point for AJAX APIs to respond to requests such as Pause, Kill, Resume etc. Flows are killed when they reach the SLA limit of 10 days.

Log Management

Executor server’s AJAX API endpoint supports streaming logs for live flows/jobs. When a flow/job finishes, the completed logs are pushed to the Azkaban db in 15MB chunks (Configurable).

Performance and Resource Tuning

The amount of CPU and Memory to assign to each Executor Server will depend on a number of factors:

Concurrent Jobs
Concurrent Tasks within Jobs
The type of job
- Python vs PowerShell scripts
- Loads vs Downstream Processing
  - Loads might be IO and CPU intensive during the source extraction process.
  - Downstream processing workloads usually occur at the database server, so should not require as much CPU or memory to perform.

Run some sample jobs and monitor CPU and memory, using only a few concurrent jobs/tasks. This will give you an idea of how to scale out.

If some job types require more memory and/or CPU consider creating specific Executors that perform those job types and assign an appropriate tag to the executor(s) and jobs to ensure they run on the appropriate executor.

As a general rule each Task in a RED Job makes up a thread and each RED task could consume between 100 - 300mb on average:

10 Jobs running in parallel but with single threaded task execution might require up to 3GB of memory. 10 threads * 300mb = 3GB
2 Jobs running in parallel but with 5 tasks running in parallel would also require up to 3GB or memory. 2 Job threads * 5 Task threads * 300mb = 3GB

If you try to run more jobs (or tasks) in parallel than your Executors can safely handle then you are likely to get out-of-memory errors on one or more of your Executor Servers. To mitigate this you can consider these solutions:

Reduce task parallelism within Jobs
Reduce the number of Jobs that are scheduled to start in the same window, i.e. spread the Job start times out.
Add more Executor Servers.
Increase each Executor Server's resources.
Create specific high spec Executor's for certain jobs and manage job distribution with tags.
Add the property executor.flow.threads to your Executors and set it to a low value to limit job concurrency on each Executor.

Azkaban Executor Server Configuration

Note

Each RED Job is a single Flow in Azkaban.

These settings can be found and/or applied in azkaban.local.properties for each Executor Server installation.

Parameter	Description	Default
`executor.flow.threads`	The number of simultaneous flows that can be run. Since each Job in RED equates to a Flow in Azkaban then depending on server resources and task parallelism leaving this setting at it's default could lead to out-of-memory failures on the Executor, Tip: Currently this setting is not present in the properties file in WhereScape's Azkaban but can be added to limit the number of parallel jobs. If you had 5 jobs running each with 5 parallel threads then you could potentially hit 25 active threads, each consuming 100-300mb on average which might require 7.5GB of memory.	30
`job.log.chunk.size`	For rolling job logs. The chuck size for each roll over	5MB
`job.log.backup.index`	The number of log chunks. The max size of each log is then the index * chunksize	4
`flow.num.job.threads`	The number of concurrent running jobs in each flow. Tip: This value is overridden in the New/Edit Job screen 'Maximum Threads' setting for each job in the RED UI. Only Jobs either started from RED or triggered via a schedule, honor this setting. For direct execution via the Azkaban Dashboard you must set it via a Flow Property Override. Each Task in a RED Job makes up a thread and each RED task could consume between 100 - 300mb on average. Control this value for each job via the RED UI and keep it as low as possible to avoid running out of executor resources.	50 (to match the theoretical max threads in the RED New/Edit Job Screen)
`job.max.Xms`	The maximum initial amount of memory each job can request. If a job requests more than this, then Azkaban server will not launch this job	1GB
`job.max.Xmx`	The maximum amount of memory each job can request. If a job requests more than this, then Azkaban server will not launch this job	2GB
`azkaban.server.flow.max.running.minutes`	The maximum time in minutes a flow will be living inside azkaban after being executed. If a flow runs longer than this, it will be killed. If smaller or equal to 0, there's no restriction on running time.	-1

Flow Property Override

When running WhereScape Jobs (Flows) Directly from Azkaban you can override global properties for a job.

When running Jobs directly from the Azkaban Dashboard the Maximum Threads setting in RED is not used and instead tasks which can run in parallel will be started until the max thread count specified for the Executor (azkaban.local.properties, property flow.num.job.threads) is reached. This could of course overload your scheduler if you have many tasks within the job which can run concurrently.

The workaround for this situation is to add a Flow Property Override in the Flow Parameters for the property flow.num.job.threads when running the Job directly to restrict the thread count down from the default max.

Content

Space Tools

Azkaban Architecture

Azkaban Executor Server Responsibilities

Dispatch

Orchestration

Flow Management

Log Management

Performance and Resource Tuning

Azkaban Executor Server Configuration

Flow Property Override

Content

Space Tools

Scheduler

Azkaban Architecture

Azkaban Executor Server Responsibilities

Dispatch

Orchestration

Flow Management

Log Management

Performance and Resource Tuning

Azkaban Executor Server Configuration

Flow Property Override