Friday, May 25, 2012

Interpreting Diagnostics Data and Making Adjustments

At this point in our diagnostics saga, we have our instances busily pumping out the data we need to manage and monitor our services.  However, it is simply putting the raw data in our storage account(s).  What we really want to do is query and analyze that data to figure out what is happening.

The Basics

Here I am going to show you the basic code for querying your data.  For this, I am going to be using LINQPad.  It is a tool that is invaluable for ad hoc querying and prototyping.  You can cut & paste the following script (hit F4 and add references and namespaces for Microsoft.WindowsAzure.StorageClient.dll and System.Data.Service.Client.dll as well).

void Main()

    var connectionString = "DefaultEndpointsProtocol=https;AccountName=youraccount;AccountKey=yourkey";
    var account = CloudStorageAccount.Parse(connectionString);
    var client = account.CreateCloudTableClient();
    var ctx = client.GetDataServiceContext();
    var deploymentId = new Guid("25d676fb-f031-42b4-aae1-039191156d1a").ToString("N").Dump();
    var q = ctx.CreateQuery<PerfCounter>("WADPerformanceCountersTable")
        .Where(f => f.RowKey.CompareTo(deploymentId) > 0 && f.RowKey.CompareTo(deploymentId + "__|") < 0)
        .Where(f => f.PartitionKey.CompareTo(DateTime.Now.AddHours(-2).GetTicks()) > 0)

    //(q as DataServiceQuery<Foo>).RequestUri.AbsoluteUri.Dump(); 
    //(q as CloudTableQuery<Foo>).Expression.Dump();

static class Funcs
    public static string GetTicks(this DateTime dt)
        return dt.Ticks.ToString("d19");

[System.Data.Services.Common.DataServiceKey("PartitionKey", "RowKey")]
class PerfCounter
    public string PartitionKey { get; set; }
    public string RowKey { get; set; }
    public DateTime Timestamp { get; set; }
    public long EventTickCount { get; set; }
    public string Role { get; set; }
    public string DeploymentId { get; set; }
    public string RoleInstance { get; set; }
    public string CounterName { get; set; }
    public string CounterValue { get; set; }
    public int Level { get; set; }
    public int EventId { get; set; }
    public string Message { get; set; }

What I have done here is setup a simple script that allows me to query the table storage location for performance counters.  There are two big (and 1 little) things to note here:

  1. Notice how I am filtering down to the deployment ID (also called Private ID) of the deployment I am interested in seeing.  If you use same storage account for multiple deployments, this is critical.
  2. Also, see how I have properly formatted the DateTime such that I can select a time range from the Partition Key appropriated.  In this example, I am retrieving the last 2 hours of data for all roles in the selected deployment.
  3. I have also commented out some useful checks you can use to test your filters.  If you uncomment the DataServiceQuery<T> line, you also should comment out the .AsTableServiceQuery() line.

Using the Data

If you haven't set absurd sample rates, you might actually get this data back in a reasonable time.  If you have lots of performance counters to monitor and/or you have high sample rates, be prepared to sit and wait for awhile.  Each tick is a single row in table storage.  You can return 1000 rows in a single IO operation.  It can take a very long time if you ask for large time ranges or have lots of data.

Once you have the query returned, you can actually export it into Excel using LINQPad and go about setting up graphs and pivot tables, etc.  This is all very doable, but also tedious.  I would not recommend this for long term management, but rather some simple point in time reporting perhaps.

For, we went a bit further.  We collect the raw data, compress, and index it for highly efficient searches by time.  We also scale the data for the time range, otherwise you can have a very hard time graphing 20,000 data points.  This makes it very easy to view both recent data (e.g. last few hours) as well as data over months.  The value of the longer term data cannot be overstated.

Anyone that really wants to know what their service has been doing will likely need to invest in monitoring tools or services (e.g.  It is simply impractical to pull more than a few hours of data by querying the WADPeformanceCountersTable directly.  It is way too slow and way too much data for longer term analysis.

The Importance of Long Running Data

For lots of operations, you can just look at the last 2 hours of your data and see how your service has been doing.  We put that view as the default view you see when charting your performance counters in  However, you really should back out the data from time to time and observe larger trends.  Here is an example:


This is actual data we had last year during our early development phase of the backend engine that processes all the data.  This is the Average CPU over 8 hours and it doesn't look too bad.  We really can't infer anything from this graph other than we are using about 15-35% of our CPU most of the time.

However, if we back that data out a bit.:


This picture tells a whole different story.  We realized that we were slowly doing more and more work with our CPU that did not correlate with the load.  This was not a sudden shift that happened in a few hours.  This was manifesting itself over weeks.  Very slow, for the same amount of operations, we were using more CPU.  A quick check on memory told us that we were also chewing up more memory:


We eventually figured out the issue and fixed it (serialization issue, btw) - can you tell where?


Eventually, we determined what our threshold CPU usage should be under certain loads by observing long term trends.  Now, we know that if our CPU spikes above 45% for more than 10 mins, it means something is amiss.  We now alert ourselves when we detect high CPU usage:


Similarly, we do this for many other counters as well.  There is no magic threshold to choose, but if you have enough data you will be able to easily pick out the threshold values for counters in your own application.

In the next post, I will talk about how we pull this data together analyzers, notifications, and automatically scale to meet demand.

Shameless plug:  Interesting in getting your own data from Windows Azure and monitoring, alerting, and scaling?  Try for free!

Monday, April 16, 2012

Getting Diagnostics Data From Windows Azure

Assuming you know what to monitor and you have configured your deployments to start monitoring, now you need to actually get the data and do something with it.

First, let's briefly recap how the Diagnostics Manager (DM) stores data.  Once it has been configured, the DM will start to buffer data to disk locally on the VM using the temporal scratch disk*.  It will buffer it using the quota policy found in configuration.  By default, this allocates 4GB of local disk space to hold diagnostics data.  You can change the quota with a little more work if you need to hold more, but most folks should be served just fine with the default.  Data is buffered as FIFO (first in, first out) in order to age out the oldest data first.

Scheduled versus OnDemand

Once the data is buffering locally on the VM, you need to somehow transfer the data from the VM to your cloud storage account.  You can do this by either setting a Scheduled or OnDemand transfer.  In practice, I tend to recommend always using Scheduled transfers and ignoring the OnDemand option (it ends up being a lot easier). 

But, for completeness, here is an example of setting an OnDemand transfer:

void Main()
    var account = new CloudStorageAccount(
        new StorageCredentialsAccountAndKey("dunnry", "yourkey"),
    var mgr = new DeploymentDiagnosticManager(account, "6468a8b749a54c3...");
    foreach (string role in mgr.GetRoleNames())
        var ridm = mgr.GetRoleInstanceDiagnosticManagersForRole(role);
        var options = new OnDemandTransferOptions()
            From = DateTime.UtcNow - TimeSpan.FromMinutes(10),
            To = DateTime.UtcNow,
            NotificationQueueName = "pollme"
        var qc = account.CreateCloudQueueClient();
        var q = qc.GetQueueReference("pollme");
        foreach (var i in ridm)
            //cancel all pending transfers
            foreach (var pt in i.GetActiveTransfers())
            var key = i.BeginOnDemandTransfer(DataBufferName.Logs, options);
            //poll here... why bother...

It's not exactly straightforward, but essentially, you need to specify the time range to transfer and optionally a queue to notify when completed.  You must ensure that all outstanding OnDemand transfers are canceled and then you can begin the transfer and ideally you should also cancel the transfer when it is completed.  In theory, this gives you some flexibility on what you want transferred.

As with most things in life, there are some gotchas to using this code.  Most of the time, folks forget to cancel the transfer after it completes.  When that happens, it prevents any updates to the affected data source.  This can impact you when you try to set new performance counters and see an error about an OnDemand transfer for instance.  As such, you end up writing a lot of code to detect and cancel pending transfers first before doing anything else in the API.

Using Scheduled transfers ends up being easier in the long run because you end up getting the same amount of data, but without having the pain of remembering to cancel pending transfers and all that.  Here is similar code (you should adapt for each data source you need to transfer):

void Main()
    var account = new CloudStorageAccount(
        new StorageCredentialsAccountAndKey("dunnry", "yourkey"),
    var mgr = new DeploymentDiagnosticManager(account, "6468a8b749a54c3...");
    foreach (string role in mgr.GetRoleNames())
        var ridm = mgr.GetRoleInstanceDiagnosticManagersForRole(role);
        foreach (var idm in ridm)
            var config = idm.GetCurrentConfiguration()
                ?? DiagnosticMonitor.GetDefaultInitialConfiguration();
            config.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);
            //set other scheduled intervals here...


This ends up being the technique we use for  When you setup your subscription with us, we detect the diagnostics connection string and allow you to change your data source settings.  For Performance Counters, we force the transfer to 5 minutes to today (a good compromise) and allow you to choose the interval for other sources (i.e. Traces, Windows Event Logs).  When you use a provider like AzureOps, it is usually best to stream the data in in relatively small chunks as opposed to say transferring once an hour.  Firstly, we won't be able to do anything with your data until we see it and you probably want to be notified sooner than 1 time an hour.  Secondly, when you set long transfer period times, there is a risk that you exceed the buffer quota and start to lose data that was never transferred.  In practice, we have not observed any noticeable overhead by transferring more often.  When in doubt, pick 5 mins.

Whew!  If you have made it this far, you now have a reasonable set of performance counters and trace information that is both being collected on your VMs in Windows Azure as well as being persisted to your storage account.  So, essentially, you need to now figure out what to do with that data.  That will be the subject of the next post in this series.


*if you are interested, RDP into an instance and check the resource drive (usually C:) under /Resources/Directory/<roleuniquename>/Monitor to see buffered data.

Sunday, February 19, 2012

Choosing What To Monitor In Windows Azure

One of the first questions I often get when onboarding a customer is "What should I be monitoring?".  There is no definitive list, but there are certainly some things that tend to be more useful.  I recommend the following Performance Counters in Windows Azure for all role types at bare minimum:

  • \Processor(_Total)\% Processor Time
  • \Memory\Available Bytes
  • \Memory\Committed Bytes
  • \.NET CLR Memory(_Global_)\% Time in GC

You would be surprised how much these 4 counters tell someone without any other input.  You can see trends over time very clearly when monitoring over weeks that will tell you what is a 'normal' range that your application should be in.  If you start to see any of these counters spike (or spike down in the case of Available Memory), this should be an indicator to you that something is going on that you should care about.

Web Roles

For ASP.NET applications, there are some additional counters that tend to be pretty useful:

  • \ASP.NET Applications(__Total__)\Requests Total
  • \ASP.NET Applications(__Total__)\Requests/Sec
  • \ASP.NET Applications(__Total__)\Requests Not Authorized
  • \ASP.NET Applications(__Total__)\Requests Timed Out
  • \ASP.NET Applications(__Total__)\Requests Not Found
  • \ASP.NET Applications(__Total__)\Request Error Events Raised
  • \Network Interface(*)\Bytes Sent/sec

If you are using something other than the latest version of .NET, you might need to choose the version specific instances of these counters.  By default, these are going to only work for .NET 4 ASP.NET apps.  If you are using .NET 2 CLR apps (including .NET 3.5), you will want to choose the version specific counters.

The last counter you see in this list is somewhat special as it includes a wildcard instance (*).  This is important to choose in Windows Azure as the names of the actual instance adapter can (and tends to) change over time and deployments.  Sometimes it is "Local Area Connection* 12", sometimes it is "Microsoft Virtual Machine Bus Network Adapter".  The latter one tends to be the one that you see most often with data, but just to be sure, I would include them all.  Note, this is not an exhaustive list - if you have custom counters or additional system counters that are meaningful, by all means, include them.  In AzureOps, we can set these remotely on your instances using the property page for your deployment.


Choosing a Sample Rate

You should not need to sample any counter faster than 30 seconds.  Period.  In fact, in 99% of all cases, I would actually recommend 120 seconds (that is our default we recommend in AzureOps).  This might seem like you are losing too much data or that you are going to miss something.  However, experience has shown that this sample rate is more than sufficient to monitor the system over days, weeks, and months with enough resolution to know what is happening in your application.  The difference between 30 seconds and 120 seconds is 4 times as much data.  When you sample at 1 and 5 second sample rates, you are talking about 120x and 24x the amount of data.  That is per instance, by the way.  If you are have more than 1 instance, now multiply that by number of instances.  It will quickly approach absurd quantities of data that costs you money in transactions and storage to store, and that has no additional value to parse, but a lot more pain to keep.  Resist the urge to put 1, 5, or even 10 seconds - try 120 seconds to start and tune down if you really need to.


The other thing I recommend for our customers is to use tracing in their application.  If you only use the built-in Trace.TraceInformation (and similar), you are ahead of the game.  There is an excellent article in MSDN about how to setup more advanced tracing with TraceSources that I recommend as well.

I recommend using tracing for a variety of reasons.  First, it will definitely help you when your app is running in the cloud and you want to gain insight into issues you see.  If you had logged exceptions to Trace or critical code paths to Trace, then you now have potential insight into your system.  Additionally, you can use this as a type of metric in the system to be mined later.  For instance, you can log length of time a particular request or operation is taking.  Later, you can pull those logs and analyze what was the bottleneck in your running application.  Within AzureOps, we can parse trace messages in variety of ways (including semantically).  We use this functionality to alert ourselves when something strange is happening (more on this in a later post).

The biggest obstacle I see with new customers is remembering to turn on transfer for their trace messages.  Luckily, within AzureOps, this is again easy to do.  Simply set a Filter Level and a Transfer Interval (I recommend 5 mins).


The Filter Level will depend a bit on how you use the filtering in your own traces.  I have seen folks that trace rarely, so lower filters are fine.  However, I have also seen customers trace upwards of 500 traces/sec.  As a point of reference, at that level of tracing, you are talking about 2GB of data on the wire each minute if you transfer at that verbosity.  Heavy tracers, beware!  I usually recommend verbose for light tracers and Warning for tracers that are instrumenting each method for instance.  You can always change this setting later, so don't worry too much right now.

Coming Up

In the next post, I will walk you through how to setup your diagnostics in Windows Azure and point out some common pitfalls that I see.



Monitoring in Windows Azure

For the last year since leaving Microsoft, I have been deeply involved in building a world class SaaS monitoring service called AzureOps.  During this time, it was inevitable that I see not only how to best monitor services running in Windows Azure, but also see the common pitfalls amongst our beta users.  It is one thing to be a Technical Evangelist like I was and occasionally use a service for a demo or two, and quite another to attempt to build a business on it.

Monitoring a running service in Windows Azure can actually be daunting if you have not worked with it before.  In this coming series, I will attempt to share the knowledge we have gained building AzureOps and from our customers.  The series will be grounded in these 5 areas:

  1. Choosing what to monitor in Windows Azure.
  2. Getting the diagnostics data from Windows Azure
  3. Interpreting diagnostics data and making adjustments
  4. Maintaining your service in Windows Azure.

Each one of these areas will be a post in the series and I will update this post to keep a link to the latest.  I will use AzureOps as an example in some cases to highlight both what we learned as well as the approach we take now due to this experience.

If you are interested in monitoring your own services in Windows Azure, grab an invite and get started today!.