Tuesday, January 29, 2013

Anatomy of a Scalable Task Scheduler

On 1/18 we quietly released a version of our scalable task scheduler (creatively named 'Scheduler' for right now) to the Windows Azure Store.  If you missed it, you can see it in this post by Scott Guthrie.  The service allows you to schedule re-occurring tasks using the well-known cron syntax.  Today, we support a simple GET webhook that will notify you each time your cron fires.  However, you can be sure that we are expanding support to more choices, including (authenticated) POST hooks, Windows Azure Queues, and Service Bus Queues to name a few.

In this post, I want to share a bit about how we designed the service to support many tenants and potentially millions of tasks.  Let's start with a simplified, but accurate overall picture:

image

We have several main subsystems in our service (REST API fa├žade, CRON Engine, and Task Engine) and additionally several shared subsystems across additional services (not pictured) such as Monitoring/Auditing and Billing/Usage.  Each one can be scaled independently depending on our load and overall system demand.  We knew that we needed to decouple our subsystems such that they did not depend on each other and could scale independently.  We also wanted to be able to develop each subsystem potentially in isolation without affecting the other subsystems in use.  As such, our systems do not communicate with each other directly, but only share a common messaging schema.  All communication is done over queues and asynchronously.

REST API

This is the layer that end users communicate with and the only way to interact with the system (even our portal acts as a client).  We use a shared secret key authentication mechanism where you sign your requests and we validate them as they enter our pipeline.  We implemented this REST API using Web API.  When you interact with the REST API, you are viewing fast, lightweight views of your scheduled task setup that reflects what is stored in our Job Repository.  However, we never query the Job Repository directly to keep it responsive to its real job - providing the source data for the CRON Engine.

CRON Engine

This subsystem was designed to do as little as possible and farm out the work to the Task Engine.  When you have an engine that evaluates cron expressions and fire times, it cannot get bogged down trying to actually do the work.  This is a potentially IO-intensive role in the subsystem that is constantly evaluating when to fire a particular cron job.  In order to support many tenants, it must be able run continuously without bogging down in execution.  As such, this role only evaluates when a particular cron job must run and then fires a command to the Task Engine to actually execute the potentially long running job.

Task Engine

The Task Engine is the grunt of the service and it performs the actual work.  It is the layer that will be scaled most dramatically depending on system load.  Commands from the CRON Engine for work are accepted and performed at this layer.  Subsequently, when the work is done it emits an event that other interested subsystems (like Audit and Billing) can subscribe to downstream.  The emitted event contains details about the outcome of the task performed and is subsequently denormalized into views that the REST API can query to provide back to a tenant.  This is how we can tell you your job history and report back any errors in execution.  The beauty of the Task Engine emitting events (instead of directly acting) is that we can subscribe many different listeners for a particular event at any time in the future.  In fact, we can orchestrate very complex workflows throughout the system as we communicate to unrelated, but vital subsystems.  This keeps our system decoupled and allows us to develop those other subsystems in isolation.

Future Enhancements

Today we are in a beta mode, intended to give us feedback about the type of jobs, frequency of execution, and what our system baseline performance should look like.  In the future, we know we will support additional types of scheduled tasks, more views into your tasks, and more complex orchestrations.  Additionally, we have setup our infrastructure such that we can deploy to to multiple datacenters for resiliency (and even multiple clouds).  Give us a try today and let us know about your experience.