Wednesday, June 4, 2014

Brewmaster Template Definition

In order to author a Brewmaster template, you will need to have a basic understanding of how the template is structured.  In this blog post, I will cover all the basic terminology.

Methodology Explained

A central tenant to the Brewmaster template is that a template represents the goal state of how you want your deployment to be configured.  This has some interesting implications.  Namely, it means that if the state of your deployment is already as described, we skip it.  For example, if you describe 3 VMs and we only find 2 VMs in an existing deployment, we will deploy 1 VM.  We won't try to deploy 3 additional VMs in this example.  If you describe a Storage Account, we will make sure it exists.  Same thing for the Affinity Group.

Since we also view the template as your goal state, it means you can have the reverse scenario with a slight caveat:  my template describes 2 VMs and Brewmaster finds 3 VMs already deployed.  In this case and with your explicit configuration, we will remove the unreferenced VM.  The caveat here is that you need to explicitly tell Brewmaster to be destructive and remove VMs.

There is one more caveat to this 'goal state' idea and that is around Network settings.  Given that Network settings are global in nature to the subscription, we didn't want you to have to describe every possible Virtual Network and Subnet for every deployment (even the ones not done in Brewmaster).  If we were strict about a 'goal state', we would have to do so.  Instead, because of the global nature of these things and the monolithic description and configuration, we have taken an additive-only approach here.  We will only attempt to add additional Virtual Networks, Subnets, DNS Servers, etc., and we will never delete them if they become unreferenced or are not fully described.  One consequence of this is that if you describe Virtual Network A with a single subnet called Subnet AA and Brewmaster finds Virtual Network A with existing Subnet BB, we will not remove Subnet BB, and instead you will be left with both Subnet AA and Subnet BB.

Now that you have the general methodology we follow, let's look at the possible configuration options.

Template Schema

At the highest level, you have 9 basic sections inside a template:

  • Parameters
  • Network
  • AffinityGroup
  • StorageAccounts
  • CloudServices
  • DeploymentGroups
  • Credentials
  • ConfigSets
  • Configurations


This section contains replaceable params used throughout the template.  Parameters values themselves are either supplied at deployment time or rely upon a described default.  Template authors can include optional metadata in the parameter definition for things like description, type hints, and even validation logic.  A parameter can be of type String, DateTime, Boolean, or Number.  Brewmaster's website will generate a dynamic UI and perform validation based on the values declared in this section.

We recommend that template authors use the type hints and validation capabilities to prevent easily avoided misconfigurations due to data entry mistakes.


The network settings described in here allow you to declare things like Virtual Networks, subnets, DNS servers, and Local Sites.  Keeping in mind the caveat that Networks are an additive-only approach, you need only describe the portions of the Networks settings in Azure that directly apply to your intended deployment.


A Brewmaster template is scoped to a single Affinity Group (AG).  Early on, we supported multiple AGs, but soon realized that things became far too complicated for a user to keep the definitions straight inside a single template.  In the IaaS world, it turns out that you really need an AG to do most anything interesting.  Virtual Networks must exist in an AG for instance.  If you want your IaaS VMs to be able to communicate, you have to have one.  The implication of having a required AG definition is two-fold:  1.)  we enforce the best practices that all your resources are in an AG, and 2.) it means you can only deploy to a single datacenter right now per deployment.  If you need to have a multi-datacenter (region) deployment, you would just deploy the same template multiple times.


Each described Cloud Service (CS) can also describe a single VmDeployment within it.  That deployment can next describe multiple VMs (up to Azure's limit).  All the supported VM settings are described within a VM (Disks, Endpoints, etc.).  Each VM can be associated with one or more ConfigSets.  I will describe this more below in detail, but ultimately, this is how you control what configuration is done on each VM.  Each VM can also be assigned a single Deployment Group (more below).  If one is not assigned, it belongs to the 'default' Deployment Group.

If the CS does not exist, Brewmaster will create it.  If the CS already exists and there is a deployment in the Production slot, Brewmaster will attempt to reconcile what is in the template with what has been found in the existing deployments.  Each machine name is unique within a CS, so if we find a matching VM name in the template configuration, Brewmaster will attempt to configure the machine as described.


Deployment Groups (DGs) are a mechanism by which you can control the dependencies between machines.  By default, all VMs are in a single DG aptly called 'default'.  If you specify additional DGs and VMs have been assigned to them, those machines will be configured completely before any other VMs are configured (end to end).  The order of the described DGs are also the order in which everything is deployed and configured.  Most templates will not use this capability - the default DG is just fine.  However, a classic example of needing this capability is something like Active Directory.  You would want your domain controllers to fully deploy and be configured and ready before additional VMs were spun up and configured if they were to also join the domain.


Instead of littering the template with a bunch of usernames and passwords, the template allows you to describe the credentials once and then refer to it everywhere.  We strongly suggest parameterizing the credentials here and not hardcoding them.  This will maximize the re-usability of your template.


ConfigSets provide a convenient way to group VMs together for common configurations.  Inside of a ConfigSet, you can configure Endpoints that will be applied to all VMs within the ConfigSet.  Additionally, this is where you associate a particular Configuration (1 or more) to a set of VMs.

At deployment time, the common Endpoints will be expanded and added into the Endpoint configuration for each VM found to reference that ConfigSet.  In general, you will have a single ConfigSet per type of VM within your template definition.  So, if I have a SQL Server Always On template, I might have 1 ConfigSet for my SQL nodes, 1 for my Quorum nodes, and 1 for my Active Directory nodes.  If you find that you have common configuration between ConfigSets (e.g. each ConfigSet needs to format data disks), that means you can factor those operations out and apply it as a reusable Configuration in each ConfigSet.


A Configuration is a set of DSC resources (or Powershell scripts) that will be applied to a VM.  You can factor your DSC resources into smaller, more re-usable sets, and reference them from the ConfigSet.  For instance, suppose I build out Active Directory and SQL Server in two ConfigSets.  I could have 2 different configurations - one for configuring Active Directory and one for configuring SQL Server.  However, if I also had some common configuration (e.g. formatting attached disks), I could create a 3rd Configuration and simply include that Configuration as a reference in each ConfigSet.  This way, we keep to the DRY principle and you only need to reference the reusable Configurations amongst one or more ConfigSets.

As a best practice, we recommend building out any custom actions as a DSC resource.  Long term, this ends up being a much more manageable practice than attempting to manage configuration as simple Powershell scripts.

Wrap Up

Hopefully this gives you a quick tour around the Brewmaster template definition and philosophy.  As we update things to support new features in Azure, you will see the schema change slightly.  Make sure to check out our documentation to keep up.

Wednesday, May 28, 2014

How Does Brewmaster Work?

In this post, I will cover at a high level what Brewmaster does behind the scenes to enable your templates to deploy in Microsoft Azure.  There are around 8 major steps that we have to orchestrate the template deployment.

First, let's start with an architectural overview:


We built Brewmaster as an API first service.  The portal where you register has no abilities other than Registration that cannot be done via API.  That is, once you are registered with us, you have a subscription and a secret key that can be used to deploy.  In some subsequent posts, I will detail how to use the API.  The key components to Brewmaster are the API layer that everyone interacts with, our Template Validation logic that helps ensure you don't have common deployment errors, as well as our Workflow Engine that handles failure, retries, and allows us to efficiently scale our resources.

All of the actions that we take on your behalf in Azure are recorded and presented back to the user in our View repositories.  You could drive your own dashboard with this data or simply use ours.

Template Registration


Brewmaster templates are stored in Git repositories.  Today, we support both Bitbucket and Github public repositories.  There is a known layout and convention for creating templates.  When you first come to Brewmaster, you will register your Git repository for your template, or choose one of our pre-existing templates (hosted at Github).  We chose Git because we wanted to be able to version the templates and always have a repeatable deployment.  This has the effect that you can have a single template with different branches for each deployment type (Production, Development, etc.)

Once you register your Git repository with us, it becomes available as a deployment option:


When you click the Deploy button, we will pull that template from Git, parse it for any parameters you have defined and generate a UI to accept those params.  Once you submit the parameters, we will combine the template and params together, expand the template into a full deployment schema and then run our Validation logic.  As a template author, you have control over validation logic as well for any parameters coming into the template.  Once the template is validated, it is passed to our Workflow engine to start the execution.

Azure Provisioning

Brewmaster templates are scoped to a single Azure Affinity Groups (AG).  We are requiring that any Azure assets you want to either use or create must reside in the same AG within a single template.  While that sounds restrictive, it really isn't.  There is not much you can do that is interesting in the IaaS world without your resources in a single AG.  As such, this also means that if you want to deploy to multiple AGs, you must create multiple templates.

Today, we support the provisioning of IaaS VMs, Cloud Services, Storage Accounts, and Virtual Networks through Brewmaster.  We may expand that to other Azure resources (e.g. Hadoop, Service Bus, CDN, etc.) as demand dictates.



In the simple (80%) cases, Brewmaster will deploy your entire set Azure resources in a single go.  However, for more complicated scenarios where there are dependencies between Azure resources, you can actually stage them using the concept of Deployment Groups.  This allows for some sets of Azure resources to be provisioned and configured before other sets.  That is a more advanced topic that I will delve into more later.

Bootstrap the VM

In order to communicate with the Windows machines and allow it to be configured, we must enable a few things on the node.  For instance, Powershell Remoting is required along with Powershell 4.  The node might be rebooted during this time if required.

Brewmaster is dependent on Desired State Configuration (DSC).  In a nutshell, DSC is a declarative Powershell syntax where you define 'what to do' and leave the 'how to do' up to DSC.  This fits perfectly in our own declarative template model.  As such, we support DSC completely.  That means you can not only use any of the built-in DSC resource, but we also support any DSC resources - community released, or subsequently released "Wave" resources.  Don't see a DSC resource you need?  No worries, we also give you access to the 'Script' DSC resource, which will allow you to call into arbitrary Powershell code (or your existing scripts).

Once we have ensured that DSC is available, Brewmaster dynamically creates a DSC configuration script from the template and pushes it onto each node.

During this process, we also push down to each a node a number of scripts and modules that the node will later execute locally.  Once your nodes have been configured with DSC and PS Remoting, we are ready to download your template package.

Pulling the Template Package

We refer to the Brewmaster template as a JSON file that describes your Azure and VM topology.  However, we refer the template package as both the template JSON file as well as any other supporting assets required for configuring the VM.  Inside the package you can include things like custom scripts or DSC resources, installers, certificates, or other file assets.  The package has a known layout that will be used by Brewmaster.

Brewmaster instructs the node to pull the template package from the Git repository.  The node must pull the package from Git as opposed to Brewmaster pushing the repository.  The main reason for this is for quality of service and unintentional (or intentional) denial of service.  Theoretically, the template package could be very big and we would not want to lock our Brewmaster workers while trying to push a large package to each node.

Running the DSC Configuration

At this point, your node has all the assets from the template package and the DSC configuration that describes how to use them all.  Brewmaster now instructs the local node to compile the DSC into the appropriate MOF file and start a staged execution.

Brewmaster periodically connects to the node to discover information about the running DSC configuration (error, status, etc.).  It pulls what it finds back into Brewmaster and exposes that data to the user at the website (or via API).  This is how you can track status of a running configuration and know when something is done.

Wrap up

Once a deployment runs to its conclusion (either failure or success), Brewmaster records the details and presents it back to the user.  In this way the user has a complete picture of what was deployed and any logs or errors that occurred.  Should something go wrong during a deployment, the user can simply fix the template definition (or an asset in the template package) and hit the 'Deploy' button again.  Brewmaster will re-execute the deployment, but this time, it will skip the things it has already done (like provisioning in Azure or full Bootstrapping) and will re-execute the DSC configurations.  Given the efficient nature of DSC, it will also only re-execute resources that are not in the 'desired state'.  As such, subsequent re-deploys can be magnitudes faster that the initial.

Next up

In the coming installments, I will talk more about the key Brewmaster concepts and what each part of the schema does and about how to troubleshoot your templates and debug them.

Thursday, May 22, 2014

Introducing Brewmaster Template SDK

I am excited to announce the release of the Brewmaster Template SDK.  With our new SDK, you can author your own Brewmaster templates that can deploy almost anything into Microsoft Azure.  Over the next few weeks, I will be updating my blog with more background on how Brewmaster templates work and how to easily author them.

First, a bit of backstory:  when we first built Brewmaster 6 months ago, we released with 5 supported templates.  The templates were for popular workloads that users had difficulty deploying easily (e.g.  SQL Server Always On).  The idea was that we would allow the user to get something bootstrapped into Azure quickly and then allow the user to simply RDP into the machine to update settings that were not exactly to their liking.  Even then, we knew that it would simply not be possible to offer enough combinations of options to satisfy every possible user and desired configuration.

The situation with static templates was that they were close(ish) for everyone, but perfect for no one.  We wanted to change that and to support arbitrarily complex deployment topologies.  With the release of our Template SDK, we have done just that.  With our new Template SDK you can:

  • Define and deploy any Windows IaaS topology in Azure.  You tell us what you want and we manage the creation in Azure.  We support any number of cloud services, VMs, networking, or storage account configurations.
  • Configure any set(s) of VMs that you deploy.  Install any software, configure any roles, manage firewalls, etc.  You name it.  We support Microsoft's Desired State Configuration (DSC) technology and we support it completely.  This means you can use any of Microsoft's many DSC configurations or author your own.  Don't have DSC resources for what you want yet?  We also support arbitrary Powershell scripts, so don't worry.
  • Clone, fork, or build any template.  We are open-sourcing all of our templates that were previously released as static templates.  We support public Git deployment from both GitHub and Bitbucket, so this means you can version any template you choose or deploy any revision.  This also means you can branch a template (e.g. DevTest, Production, Feature-AB) and deploy any or all of them at runtime.
  • Easily author your templates.  While the template itself is JSON-based, we are also shipping C# fluent syntax builders that give you the Intellisense you need to build out any configuration.  For more advanced configurations, we also support the Liquid template syntax.  This means you can put complex if/then, looping, filtering, and other control logic directly in your templates should you desire.

Brewmaster itself was built API first.  What this means is that you can script out any interaction with Brewmaster.  Want to deploy daily/weekly/hourly as part of your CI build?  With Brewmaster, you can do this as well.

One other thing:  Did I mention this was free?  Everything we are talking about here is being released for free.  Deploy for free, build for free,  and share your templates for free.

Visit to register and get started today!

Tuesday, January 29, 2013

Anatomy of a Scalable Task Scheduler

On 1/18 we quietly released a version of our scalable task scheduler (creatively named 'Scheduler' for right now) to the Windows Azure Store.  If you missed it, you can see it in this post by Scott Guthrie.  The service allows you to schedule re-occurring tasks using the well-known cron syntax.  Today, we support a simple GET webhook that will notify you each time your cron fires.  However, you can be sure that we are expanding support to more choices, including (authenticated) POST hooks, Windows Azure Queues, and Service Bus Queues to name a few.

In this post, I want to share a bit about how we designed the service to support many tenants and potentially millions of tasks.  Let's start with a simplified, but accurate overall picture:


We have several main subsystems in our service (REST API fa├žade, CRON Engine, and Task Engine) and additionally several shared subsystems across additional services (not pictured) such as Monitoring/Auditing and Billing/Usage.  Each one can be scaled independently depending on our load and overall system demand.  We knew that we needed to decouple our subsystems such that they did not depend on each other and could scale independently.  We also wanted to be able to develop each subsystem potentially in isolation without affecting the other subsystems in use.  As such, our systems do not communicate with each other directly, but only share a common messaging schema.  All communication is done over queues and asynchronously.


This is the layer that end users communicate with and the only way to interact with the system (even our portal acts as a client).  We use a shared secret key authentication mechanism where you sign your requests and we validate them as they enter our pipeline.  We implemented this REST API using Web API.  When you interact with the REST API, you are viewing fast, lightweight views of your scheduled task setup that reflects what is stored in our Job Repository.  However, we never query the Job Repository directly to keep it responsive to its real job - providing the source data for the CRON Engine.

CRON Engine

This subsystem was designed to do as little as possible and farm out the work to the Task Engine.  When you have an engine that evaluates cron expressions and fire times, it cannot get bogged down trying to actually do the work.  This is a potentially IO-intensive role in the subsystem that is constantly evaluating when to fire a particular cron job.  In order to support many tenants, it must be able run continuously without bogging down in execution.  As such, this role only evaluates when a particular cron job must run and then fires a command to the Task Engine to actually execute the potentially long running job.

Task Engine

The Task Engine is the grunt of the service and it performs the actual work.  It is the layer that will be scaled most dramatically depending on system load.  Commands from the CRON Engine for work are accepted and performed at this layer.  Subsequently, when the work is done it emits an event that other interested subsystems (like Audit and Billing) can subscribe to downstream.  The emitted event contains details about the outcome of the task performed and is subsequently denormalized into views that the REST API can query to provide back to a tenant.  This is how we can tell you your job history and report back any errors in execution.  The beauty of the Task Engine emitting events (instead of directly acting) is that we can subscribe many different listeners for a particular event at any time in the future.  In fact, we can orchestrate very complex workflows throughout the system as we communicate to unrelated, but vital subsystems.  This keeps our system decoupled and allows us to develop those other subsystems in isolation.

Future Enhancements

Today we are in a beta mode, intended to give us feedback about the type of jobs, frequency of execution, and what our system baseline performance should look like.  In the future, we know we will support additional types of scheduled tasks, more views into your tasks, and more complex orchestrations.  Additionally, we have setup our infrastructure such that we can deploy to to multiple datacenters for resiliency (and even multiple clouds).  Give us a try today and let us know about your experience.

Thursday, December 20, 2012

Setting ClaimsAuthenticationManager Programmatically in .NET 4.5

This is a quick post today that might save folks the same trouble I had to go through when upgrading my Windows Identity Foundation (WIF) enabled MVC website to the latest version of .NET.  The scenario is that you might want to enrich the claims coming from your STS with additional claims of your choosing.  To do this, there is a common technique of creating a class the derives from ClaimsAuthenticationManager and overrides the Authenticate method.  Consider this sample ClaimsAuthenticationManager:

The issue we have is that we need to provide an implementation of ITenantRepository here in order to lookup the data for the additional claims we are adding.  If you are lucky enough to find the article on MSDN, it will show you how to wire in a custom ClaimsAuthenticationManager using the web.config.  I don't want to hardcode references to an implementation of my TenantRepository, so using config is not a great option for me.

In the older WIF model (Microsoft.IdentityModel) for .NET <= 4.0, you hooked the ServiceConfigurationCreated event:

But, in .NET 4.5, all of the namespaces and a lot of the classes are updated (System.IdentityModel).  It took me a long time in Reflector to figure out how to hook the configuration being created again.  Turns out you need to reference System.IdentityModel.Services and find the FederatedAuthentication class.  Here you go:

Happy WIF-ing.

Wednesday, October 10, 2012

Next Stop: Aditi Technologies

I am excited to announce that I have officially joined Aditi Technologies as Director of Product Services.  Taking what I have learned building large scale solutions in Windows Azure, I will be responsible for building Aditi's own portfolio of SaaS services and IP/frameworks.  We have a number of exciting projects underway and I hope to blog more about what we are building soon.

Along with this move, I get to rejoin Wade (now my boss!) and Steve as well as some of my former Cumulux colleagues.  I took this role because I see a great opportunity to build software and services in the 'cloud' and I am convinced that Aditi has been making the right investments.  It doesn't hurt at all that I get to work with top-notch technologists either.

Along the way, I plan to build a team to deliver on these cloud services.  If you think you have what it takes to build great software, send me a note and your resume.  Thanks!



Friday, May 25, 2012

Interpreting Diagnostics Data and Making Adjustments

At this point in our diagnostics saga, we have our instances busily pumping out the data we need to manage and monitor our services.  However, it is simply putting the raw data in our storage account(s).  What we really want to do is query and analyze that data to figure out what is happening.

The Basics

Here I am going to show you the basic code for querying your data.  For this, I am going to be using LINQPad.  It is a tool that is invaluable for ad hoc querying and prototyping.  You can cut & paste the following script (hit F4 and add references and namespaces for Microsoft.WindowsAzure.StorageClient.dll and System.Data.Service.Client.dll as well).

void Main()

    var connectionString = "DefaultEndpointsProtocol=https;AccountName=youraccount;AccountKey=yourkey";
    var account = CloudStorageAccount.Parse(connectionString);
    var client = account.CreateCloudTableClient();
    var ctx = client.GetDataServiceContext();
    var deploymentId = new Guid("25d676fb-f031-42b4-aae1-039191156d1a").ToString("N").Dump();
    var q = ctx.CreateQuery<PerfCounter>("WADPerformanceCountersTable")
        .Where(f => f.RowKey.CompareTo(deploymentId) > 0 && f.RowKey.CompareTo(deploymentId + "__|") < 0)
        .Where(f => f.PartitionKey.CompareTo(DateTime.Now.AddHours(-2).GetTicks()) > 0)

    //(q as DataServiceQuery<Foo>).RequestUri.AbsoluteUri.Dump(); 
    //(q as CloudTableQuery<Foo>).Expression.Dump();

static class Funcs
    public static string GetTicks(this DateTime dt)
        return dt.Ticks.ToString("d19");

[System.Data.Services.Common.DataServiceKey("PartitionKey", "RowKey")]
class PerfCounter
    public string PartitionKey { get; set; }
    public string RowKey { get; set; }
    public DateTime Timestamp { get; set; }
    public long EventTickCount { get; set; }
    public string Role { get; set; }
    public string DeploymentId { get; set; }
    public string RoleInstance { get; set; }
    public string CounterName { get; set; }
    public string CounterValue { get; set; }
    public int Level { get; set; }
    public int EventId { get; set; }
    public string Message { get; set; }

What I have done here is setup a simple script that allows me to query the table storage location for performance counters.  There are two big (and 1 little) things to note here:

  1. Notice how I am filtering down to the deployment ID (also called Private ID) of the deployment I am interested in seeing.  If you use same storage account for multiple deployments, this is critical.
  2. Also, see how I have properly formatted the DateTime such that I can select a time range from the Partition Key appropriated.  In this example, I am retrieving the last 2 hours of data for all roles in the selected deployment.
  3. I have also commented out some useful checks you can use to test your filters.  If you uncomment the DataServiceQuery<T> line, you also should comment out the .AsTableServiceQuery() line.

Using the Data

If you haven't set absurd sample rates, you might actually get this data back in a reasonable time.  If you have lots of performance counters to monitor and/or you have high sample rates, be prepared to sit and wait for awhile.  Each tick is a single row in table storage.  You can return 1000 rows in a single IO operation.  It can take a very long time if you ask for large time ranges or have lots of data.

Once you have the query returned, you can actually export it into Excel using LINQPad and go about setting up graphs and pivot tables, etc.  This is all very doable, but also tedious.  I would not recommend this for long term management, but rather some simple point in time reporting perhaps.

For, we went a bit further.  We collect the raw data, compress, and index it for highly efficient searches by time.  We also scale the data for the time range, otherwise you can have a very hard time graphing 20,000 data points.  This makes it very easy to view both recent data (e.g. last few hours) as well as data over months.  The value of the longer term data cannot be overstated.

Anyone that really wants to know what their service has been doing will likely need to invest in monitoring tools or services (e.g.  It is simply impractical to pull more than a few hours of data by querying the WADPeformanceCountersTable directly.  It is way too slow and way too much data for longer term analysis.

The Importance of Long Running Data

For lots of operations, you can just look at the last 2 hours of your data and see how your service has been doing.  We put that view as the default view you see when charting your performance counters in  However, you really should back out the data from time to time and observe larger trends.  Here is an example:


This is actual data we had last year during our early development phase of the backend engine that processes all the data.  This is the Average CPU over 8 hours and it doesn't look too bad.  We really can't infer anything from this graph other than we are using about 15-35% of our CPU most of the time.

However, if we back that data out a bit.:


This picture tells a whole different story.  We realized that we were slowly doing more and more work with our CPU that did not correlate with the load.  This was not a sudden shift that happened in a few hours.  This was manifesting itself over weeks.  Very slow, for the same amount of operations, we were using more CPU.  A quick check on memory told us that we were also chewing up more memory:


We eventually figured out the issue and fixed it (serialization issue, btw) - can you tell where?


Eventually, we determined what our threshold CPU usage should be under certain loads by observing long term trends.  Now, we know that if our CPU spikes above 45% for more than 10 mins, it means something is amiss.  We now alert ourselves when we detect high CPU usage:


Similarly, we do this for many other counters as well.  There is no magic threshold to choose, but if you have enough data you will be able to easily pick out the threshold values for counters in your own application.

In the next post, I will talk about how we pull this data together analyzers, notifications, and automatically scale to meet demand.

Shameless plug:  Interesting in getting your own data from Windows Azure and monitoring, alerting, and scaling?  Try for free!

Monday, April 16, 2012

Getting Diagnostics Data From Windows Azure

Assuming you know what to monitor and you have configured your deployments to start monitoring, now you need to actually get the data and do something with it.

First, let's briefly recap how the Diagnostics Manager (DM) stores data.  Once it has been configured, the DM will start to buffer data to disk locally on the VM using the temporal scratch disk*.  It will buffer it using the quota policy found in configuration.  By default, this allocates 4GB of local disk space to hold diagnostics data.  You can change the quota with a little more work if you need to hold more, but most folks should be served just fine with the default.  Data is buffered as FIFO (first in, first out) in order to age out the oldest data first.

Scheduled versus OnDemand

Once the data is buffering locally on the VM, you need to somehow transfer the data from the VM to your cloud storage account.  You can do this by either setting a Scheduled or OnDemand transfer.  In practice, I tend to recommend always using Scheduled transfers and ignoring the OnDemand option (it ends up being a lot easier). 

But, for completeness, here is an example of setting an OnDemand transfer:

void Main()
    var account = new CloudStorageAccount(
        new StorageCredentialsAccountAndKey("dunnry", "yourkey"),
    var mgr = new DeploymentDiagnosticManager(account, "6468a8b749a54c3...");
    foreach (string role in mgr.GetRoleNames())
        var ridm = mgr.GetRoleInstanceDiagnosticManagersForRole(role);
        var options = new OnDemandTransferOptions()
            From = DateTime.UtcNow - TimeSpan.FromMinutes(10),
            To = DateTime.UtcNow,
            NotificationQueueName = "pollme"
        var qc = account.CreateCloudQueueClient();
        var q = qc.GetQueueReference("pollme");
        foreach (var i in ridm)
            //cancel all pending transfers
            foreach (var pt in i.GetActiveTransfers())
            var key = i.BeginOnDemandTransfer(DataBufferName.Logs, options);
            //poll here... why bother...

It's not exactly straightforward, but essentially, you need to specify the time range to transfer and optionally a queue to notify when completed.  You must ensure that all outstanding OnDemand transfers are canceled and then you can begin the transfer and ideally you should also cancel the transfer when it is completed.  In theory, this gives you some flexibility on what you want transferred.

As with most things in life, there are some gotchas to using this code.  Most of the time, folks forget to cancel the transfer after it completes.  When that happens, it prevents any updates to the affected data source.  This can impact you when you try to set new performance counters and see an error about an OnDemand transfer for instance.  As such, you end up writing a lot of code to detect and cancel pending transfers first before doing anything else in the API.

Using Scheduled transfers ends up being easier in the long run because you end up getting the same amount of data, but without having the pain of remembering to cancel pending transfers and all that.  Here is similar code (you should adapt for each data source you need to transfer):

void Main()
    var account = new CloudStorageAccount(
        new StorageCredentialsAccountAndKey("dunnry", "yourkey"),
    var mgr = new DeploymentDiagnosticManager(account, "6468a8b749a54c3...");
    foreach (string role in mgr.GetRoleNames())
        var ridm = mgr.GetRoleInstanceDiagnosticManagersForRole(role);
        foreach (var idm in ridm)
            var config = idm.GetCurrentConfiguration()
                ?? DiagnosticMonitor.GetDefaultInitialConfiguration();
            config.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);
            //set other scheduled intervals here...


This ends up being the technique we use for  When you setup your subscription with us, we detect the diagnostics connection string and allow you to change your data source settings.  For Performance Counters, we force the transfer to 5 minutes to today (a good compromise) and allow you to choose the interval for other sources (i.e. Traces, Windows Event Logs).  When you use a provider like AzureOps, it is usually best to stream the data in in relatively small chunks as opposed to say transferring once an hour.  Firstly, we won't be able to do anything with your data until we see it and you probably want to be notified sooner than 1 time an hour.  Secondly, when you set long transfer period times, there is a risk that you exceed the buffer quota and start to lose data that was never transferred.  In practice, we have not observed any noticeable overhead by transferring more often.  When in doubt, pick 5 mins.

Whew!  If you have made it this far, you now have a reasonable set of performance counters and trace information that is both being collected on your VMs in Windows Azure as well as being persisted to your storage account.  So, essentially, you need to now figure out what to do with that data.  That will be the subject of the next post in this series.


*if you are interested, RDP into an instance and check the resource drive (usually C:) under /Resources/Directory/<roleuniquename>/Monitor to see buffered data.

Monday, February 27, 2012

Setting Up Diagnostics Monitoring In Windows Azure

In order to actually monitor anything in Windows Azure, you need to use the Diagnostics Manager (DM) that ships out of box.  SaaS providers like AzureOps rely on this data in order to tell you how your system is behaving.  The DM actually supports a few data sources that it can collect and transfer:

  • Performance Counters
  • Trace logs
  • IIS Logs
  • Event Logs
  • Infrastructure Logs
  • Arbitrary logs

One of the most common issues I hear from customers is that they don't know how to get started using the DM or they think they are using it and just cannot find the data where they think they should.  Hopefully, this post will clear up a bit about how the DM actually works and how to configure it.  The next post will talk about how to get the data once you are setup.

Setting UP The Diagnostics Manager

Everything starts by checking a box.  When you check the little box in Visual Studio that says "Enable Diagnostics", it actually modifies your Service Definition to include a role plugin.  Role plugins are little things that can add to your definition and configuration similar to a macro.  If you have ever used Diagnostics or the RDP capability in Windows Azure, you have used a role plugin.  For most of their history, these plugins have been exclusively built by Microsoft, but there is really nothing stopping you from using it yourself (that is another topic).



If we check our SDK folder in the plugins directory in the 'diagnostics' folder, you will actually find the magic that is used to launch the DM.

<?xml version="1.0" ?>
  <Startup priority="-2">
    <Task commandLine="DiagnosticsAgent.exe" executionContext="limited" taskType="background" />
    <Task commandLine="DiagnosticsAgent.exe /blockStartup" executionContext="limited" taskType="simple" />
    <Setting name="ConnectionString" />

Here, we can see that the DM is implemented as a pair of startup tasks.  Notice, it is using a task type of background (the other is blocking until it gets going).  This means that the DM exists outside the code you write and should be impervious to your code crashing and taking it down as well.  You can also see that the startup tasks listed here will be run with a default priority of -2.  This just tries to ensure that they run before any other startup tasks.  The idea is that you want the DM to start before other stuff so it can collect data for you.

You can also see in the definition the declaration of a new ConfigurationSettings with a single Setting called 'ConnectionString'.  If you are using Visual Studio when you import the Diagnostics plugin, you will see that the tooling automatically combines the namespace with the settings name and creates a new Setting called Microsoft.Windows.Plugins.Diagnostics.ConnectionString.  This setting will not exist if you are building your csdef or cscfg files by hand.  You must remember to include it.

Once you have the plugin actually enabled in your solution, you will need to specify a valid connection string in order for the DM to operate.  Here you have two choices:

  1. Running in emulation, it is valid to use "UseDevelopmentStorage=true" as ConnectionString.
  2. Before deploying to cloud, you must remember to update that to a valid storage account (i.e. "DefaultEndpointsProtocol=https;AccountName=youraccount;AccountKey=nrIXB.")

Common Pitfalls

It seems simple enough, but here come the first set of common pitfalls I see:

  1. Forgetting to set the ConnectionString to a valid storage account and deploying with 'UseDevelopmentStorage=true'.  This has become less of a factor in 1.6+ SDK tooling because you will notice the checkbox that says, "Use publish storage account as connection string when you publish to Windows Azure".  However, tooling will not help you here for automated deploys or when you forget to check that box.
  2. Using "DefaultEndpointsProtocol=http" in the connection string (note the missing 's' from 'https').  While it is technically possible to use the DM with an http connection, it is not worth the hassle.  Just use https and save yourself the hassle of troubleshooting this later.
  3. Setting an invalid connection string.  Hard to believe, but I see it all the time now on AzureOps.  This usually falls into two categories: deleting a storage account, and regenerating a storage key.  If you delete a storage account, but forget to remove that as the ConnectionString, things won't work (shocking, I know).  Further, if you decide to regenerate the primary or secondary storage keys and you were using them, stuff won't work here either.  Seems obvious, but you won't actually get any warning on this.  Stuff won't work and you will have to figure that out yourself.  A good 3rd party provider (like AzureOps) will let you know however.
  4. Forgetting to co-locate the diagnostics storage account with the hosted service.  This one might not show itself until you see the bill.  The diagnostics agent can be pretty chatty.  I have seen GBs of data logged in a single minute.  Forgetting to co-locate that would run you a pretty hefty bandwidth bill in addition to slowing you down.

Best Practices

Setting up the Diagnostics Manager is not terribly hard, but easy to get wrong if you are not familiar with it.  There are some other subtle things you can do here that will shoot yourself in the foot however.  Here are some things you can do that will make your life easier:

  1. Always separate your diagnostics storage account from other storage accounts.  This is especially important for production systems.  You do not want diagnostics competing with your primary storage account for resources.  There is an account wide, 5000 transactions per second limit across tables, queues, and blobs.  When you use a single storage account for both, you could unintentionally throttle your production account.
  2. If possible, use a different diagnostics storage account per hosted service.  If that is not practical, at least try to separate storage accounts for production versus non-production systems.  It turns out that querying diagnostics data can be difficult if there are many different systems logging to the same diagnostics tables.  What I have seen many times is someone use the same diagnostics account for load testing against non-production systems as their production system.  What happens is that the amount of data for the non-production system can greatly exceed the production systems.  The query mechanism for then finding production data is akin to finding a needle in the haystack.  It can take a very long time in some cases to query even for simple things.
  3. Don't' use the Anywhere location.  This applies to all storage accounts and all hosted services.  This might seem obvious, but I see it all the time.  It is possible to use Anywhere location with affinity groups and avoid pitfall #4, but it is not worth the hassle.  Additionally, if you have a 3rd party (like AzureOps) that is monitoring your data, we cannot geo-locate a worker next to you to pull your data.  We won't know where you are located and it could mean big bandwidth bills for you.

At this point, if you have enabled the DM, and remembered to set a valid connection string, you are almost home.  The last thing to do is actually get the data and avoid common pitfalls there.  That is the topic for the next post.

Sunday, February 19, 2012

Choosing What To Monitor In Windows Azure

One of the first questions I often get when onboarding a customer is "What should I be monitoring?".  There is no definitive list, but there are certainly some things that tend to be more useful.  I recommend the following Performance Counters in Windows Azure for all role types at bare minimum:

  • \Processor(_Total)\% Processor Time
  • \Memory\Available Bytes
  • \Memory\Committed Bytes
  • \.NET CLR Memory(_Global_)\% Time in GC

You would be surprised how much these 4 counters tell someone without any other input.  You can see trends over time very clearly when monitoring over weeks that will tell you what is a 'normal' range that your application should be in.  If you start to see any of these counters spike (or spike down in the case of Available Memory), this should be an indicator to you that something is going on that you should care about.

Web Roles

For ASP.NET applications, there are some additional counters that tend to be pretty useful:

  • \ASP.NET Applications(__Total__)\Requests Total
  • \ASP.NET Applications(__Total__)\Requests/Sec
  • \ASP.NET Applications(__Total__)\Requests Not Authorized
  • \ASP.NET Applications(__Total__)\Requests Timed Out
  • \ASP.NET Applications(__Total__)\Requests Not Found
  • \ASP.NET Applications(__Total__)\Request Error Events Raised
  • \Network Interface(*)\Bytes Sent/sec

If you are using something other than the latest version of .NET, you might need to choose the version specific instances of these counters.  By default, these are going to only work for .NET 4 ASP.NET apps.  If you are using .NET 2 CLR apps (including .NET 3.5), you will want to choose the version specific counters.

The last counter you see in this list is somewhat special as it includes a wildcard instance (*).  This is important to choose in Windows Azure as the names of the actual instance adapter can (and tends to) change over time and deployments.  Sometimes it is "Local Area Connection* 12", sometimes it is "Microsoft Virtual Machine Bus Network Adapter".  The latter one tends to be the one that you see most often with data, but just to be sure, I would include them all.  Note, this is not an exhaustive list - if you have custom counters or additional system counters that are meaningful, by all means, include them.  In AzureOps, we can set these remotely on your instances using the property page for your deployment.


Choosing a Sample Rate

You should not need to sample any counter faster than 30 seconds.  Period.  In fact, in 99% of all cases, I would actually recommend 120 seconds (that is our default we recommend in AzureOps).  This might seem like you are losing too much data or that you are going to miss something.  However, experience has shown that this sample rate is more than sufficient to monitor the system over days, weeks, and months with enough resolution to know what is happening in your application.  The difference between 30 seconds and 120 seconds is 4 times as much data.  When you sample at 1 and 5 second sample rates, you are talking about 120x and 24x the amount of data.  That is per instance, by the way.  If you are have more than 1 instance, now multiply that by number of instances.  It will quickly approach absurd quantities of data that costs you money in transactions and storage to store, and that has no additional value to parse, but a lot more pain to keep.  Resist the urge to put 1, 5, or even 10 seconds - try 120 seconds to start and tune down if you really need to.


The other thing I recommend for our customers is to use tracing in their application.  If you only use the built-in Trace.TraceInformation (and similar), you are ahead of the game.  There is an excellent article in MSDN about how to setup more advanced tracing with TraceSources that I recommend as well.

I recommend using tracing for a variety of reasons.  First, it will definitely help you when your app is running in the cloud and you want to gain insight into issues you see.  If you had logged exceptions to Trace or critical code paths to Trace, then you now have potential insight into your system.  Additionally, you can use this as a type of metric in the system to be mined later.  For instance, you can log length of time a particular request or operation is taking.  Later, you can pull those logs and analyze what was the bottleneck in your running application.  Within AzureOps, we can parse trace messages in variety of ways (including semantically).  We use this functionality to alert ourselves when something strange is happening (more on this in a later post).

The biggest obstacle I see with new customers is remembering to turn on transfer for their trace messages.  Luckily, within AzureOps, this is again easy to do.  Simply set a Filter Level and a Transfer Interval (I recommend 5 mins).


The Filter Level will depend a bit on how you use the filtering in your own traces.  I have seen folks that trace rarely, so lower filters are fine.  However, I have also seen customers trace upwards of 500 traces/sec.  As a point of reference, at that level of tracing, you are talking about 2GB of data on the wire each minute if you transfer at that verbosity.  Heavy tracers, beware!  I usually recommend verbose for light tracers and Warning for tracers that are instrumenting each method for instance.  You can always change this setting later, so don't worry too much right now.

Coming Up

In the next post, I will walk you through how to setup your diagnostics in Windows Azure and point out some common pitfalls that I see.