Monday, February 27, 2012

Setting Up Diagnostics Monitoring In Windows Azure

In order to actually monitor anything in Windows Azure, you need to use the Diagnostics Manager (DM) that ships out of box.  SaaS providers like AzureOps rely on this data in order to tell you how your system is behaving.  The DM actually supports a few data sources that it can collect and transfer:

  • Performance Counters
  • Trace logs
  • IIS Logs
  • Event Logs
  • Infrastructure Logs
  • Arbitrary logs

One of the most common issues I hear from customers is that they don't know how to get started using the DM or they think they are using it and just cannot find the data where they think they should.  Hopefully, this post will clear up a bit about how the DM actually works and how to configure it.  The next post will talk about how to get the data once you are setup.

Setting UP The Diagnostics Manager

Everything starts by checking a box.  When you check the little box in Visual Studio that says "Enable Diagnostics", it actually modifies your Service Definition to include a role plugin.  Role plugins are little things that can add to your definition and configuration similar to a macro.  If you have ever used Diagnostics or the RDP capability in Windows Azure, you have used a role plugin.  For most of their history, these plugins have been exclusively built by Microsoft, but there is really nothing stopping you from using it yourself (that is another topic).



If we check our SDK folder in the plugins directory in the 'diagnostics' folder, you will actually find the magic that is used to launch the DM.

<?xml version="1.0" ?>
  <Startup priority="-2">
    <Task commandLine="DiagnosticsAgent.exe" executionContext="limited" taskType="background" />
    <Task commandLine="DiagnosticsAgent.exe /blockStartup" executionContext="limited" taskType="simple" />
    <Setting name="ConnectionString" />

Here, we can see that the DM is implemented as a pair of startup tasks.  Notice, it is using a task type of background (the other is blocking until it gets going).  This means that the DM exists outside the code you write and should be impervious to your code crashing and taking it down as well.  You can also see that the startup tasks listed here will be run with a default priority of -2.  This just tries to ensure that they run before any other startup tasks.  The idea is that you want the DM to start before other stuff so it can collect data for you.

You can also see in the definition the declaration of a new ConfigurationSettings with a single Setting called 'ConnectionString'.  If you are using Visual Studio when you import the Diagnostics plugin, you will see that the tooling automatically combines the namespace with the settings name and creates a new Setting called Microsoft.Windows.Plugins.Diagnostics.ConnectionString.  This setting will not exist if you are building your csdef or cscfg files by hand.  You must remember to include it.

Once you have the plugin actually enabled in your solution, you will need to specify a valid connection string in order for the DM to operate.  Here you have two choices:

  1. Running in emulation, it is valid to use "UseDevelopmentStorage=true" as ConnectionString.
  2. Before deploying to cloud, you must remember to update that to a valid storage account (i.e. "DefaultEndpointsProtocol=https;AccountName=youraccount;AccountKey=nrIXB.")

Common Pitfalls

It seems simple enough, but here come the first set of common pitfalls I see:

  1. Forgetting to set the ConnectionString to a valid storage account and deploying with 'UseDevelopmentStorage=true'.  This has become less of a factor in 1.6+ SDK tooling because you will notice the checkbox that says, "Use publish storage account as connection string when you publish to Windows Azure".  However, tooling will not help you here for automated deploys or when you forget to check that box.
  2. Using "DefaultEndpointsProtocol=http" in the connection string (note the missing 's' from 'https').  While it is technically possible to use the DM with an http connection, it is not worth the hassle.  Just use https and save yourself the hassle of troubleshooting this later.
  3. Setting an invalid connection string.  Hard to believe, but I see it all the time now on AzureOps.  This usually falls into two categories: deleting a storage account, and regenerating a storage key.  If you delete a storage account, but forget to remove that as the ConnectionString, things won't work (shocking, I know).  Further, if you decide to regenerate the primary or secondary storage keys and you were using them, stuff won't work here either.  Seems obvious, but you won't actually get any warning on this.  Stuff won't work and you will have to figure that out yourself.  A good 3rd party provider (like AzureOps) will let you know however.
  4. Forgetting to co-locate the diagnostics storage account with the hosted service.  This one might not show itself until you see the bill.  The diagnostics agent can be pretty chatty.  I have seen GBs of data logged in a single minute.  Forgetting to co-locate that would run you a pretty hefty bandwidth bill in addition to slowing you down.

Best Practices

Setting up the Diagnostics Manager is not terribly hard, but easy to get wrong if you are not familiar with it.  There are some other subtle things you can do here that will shoot yourself in the foot however.  Here are some things you can do that will make your life easier:

  1. Always separate your diagnostics storage account from other storage accounts.  This is especially important for production systems.  You do not want diagnostics competing with your primary storage account for resources.  There is an account wide, 5000 transactions per second limit across tables, queues, and blobs.  When you use a single storage account for both, you could unintentionally throttle your production account.
  2. If possible, use a different diagnostics storage account per hosted service.  If that is not practical, at least try to separate storage accounts for production versus non-production systems.  It turns out that querying diagnostics data can be difficult if there are many different systems logging to the same diagnostics tables.  What I have seen many times is someone use the same diagnostics account for load testing against non-production systems as their production system.  What happens is that the amount of data for the non-production system can greatly exceed the production systems.  The query mechanism for then finding production data is akin to finding a needle in the haystack.  It can take a very long time in some cases to query even for simple things.
  3. Don't' use the Anywhere location.  This applies to all storage accounts and all hosted services.  This might seem obvious, but I see it all the time.  It is possible to use Anywhere location with affinity groups and avoid pitfall #4, but it is not worth the hassle.  Additionally, if you have a 3rd party (like AzureOps) that is monitoring your data, we cannot geo-locate a worker next to you to pull your data.  We won't know where you are located and it could mean big bandwidth bills for you.

At this point, if you have enabled the DM, and remembered to set a valid connection string, you are almost home.  The last thing to do is actually get the data and avoid common pitfalls there.  That is the topic for the next post.