Friday, May 25, 2012

Interpreting Diagnostics Data and Making Adjustments

At this point in our diagnostics saga, we have our instances busily pumping out the data we need to manage and monitor our services.  However, it is simply putting the raw data in our storage account(s).  What we really want to do is query and analyze that data to figure out what is happening.

The Basics

Here I am going to show you the basic code for querying your data.  For this, I am going to be using LINQPad.  It is a tool that is invaluable for ad hoc querying and prototyping.  You can cut & paste the following script (hit F4 and add references and namespaces for Microsoft.WindowsAzure.StorageClient.dll and System.Data.Service.Client.dll as well).

void Main()

    var connectionString = "DefaultEndpointsProtocol=https;AccountName=youraccount;AccountKey=yourkey";
    var account = CloudStorageAccount.Parse(connectionString);
    var client = account.CreateCloudTableClient();
    var ctx = client.GetDataServiceContext();
    var deploymentId = new Guid("25d676fb-f031-42b4-aae1-039191156d1a").ToString("N").Dump();
    var q = ctx.CreateQuery<PerfCounter>("WADPerformanceCountersTable")
        .Where(f => f.RowKey.CompareTo(deploymentId) > 0 && f.RowKey.CompareTo(deploymentId + "__|") < 0)
        .Where(f => f.PartitionKey.CompareTo(DateTime.Now.AddHours(-2).GetTicks()) > 0)

    //(q as DataServiceQuery<Foo>).RequestUri.AbsoluteUri.Dump(); 
    //(q as CloudTableQuery<Foo>).Expression.Dump();

static class Funcs
    public static string GetTicks(this DateTime dt)
        return dt.Ticks.ToString("d19");

[System.Data.Services.Common.DataServiceKey("PartitionKey", "RowKey")]
class PerfCounter
    public string PartitionKey { get; set; }
    public string RowKey { get; set; }
    public DateTime Timestamp { get; set; }
    public long EventTickCount { get; set; }
    public string Role { get; set; }
    public string DeploymentId { get; set; }
    public string RoleInstance { get; set; }
    public string CounterName { get; set; }
    public string CounterValue { get; set; }
    public int Level { get; set; }
    public int EventId { get; set; }
    public string Message { get; set; }

What I have done here is setup a simple script that allows me to query the table storage location for performance counters.  There are two big (and 1 little) things to note here:

  1. Notice how I am filtering down to the deployment ID (also called Private ID) of the deployment I am interested in seeing.  If you use same storage account for multiple deployments, this is critical.
  2. Also, see how I have properly formatted the DateTime such that I can select a time range from the Partition Key appropriated.  In this example, I am retrieving the last 2 hours of data for all roles in the selected deployment.
  3. I have also commented out some useful checks you can use to test your filters.  If you uncomment the DataServiceQuery<T> line, you also should comment out the .AsTableServiceQuery() line.

Using the Data

If you haven't set absurd sample rates, you might actually get this data back in a reasonable time.  If you have lots of performance counters to monitor and/or you have high sample rates, be prepared to sit and wait for awhile.  Each tick is a single row in table storage.  You can return 1000 rows in a single IO operation.  It can take a very long time if you ask for large time ranges or have lots of data.

Once you have the query returned, you can actually export it into Excel using LINQPad and go about setting up graphs and pivot tables, etc.  This is all very doable, but also tedious.  I would not recommend this for long term management, but rather some simple point in time reporting perhaps.

For, we went a bit further.  We collect the raw data, compress, and index it for highly efficient searches by time.  We also scale the data for the time range, otherwise you can have a very hard time graphing 20,000 data points.  This makes it very easy to view both recent data (e.g. last few hours) as well as data over months.  The value of the longer term data cannot be overstated.

Anyone that really wants to know what their service has been doing will likely need to invest in monitoring tools or services (e.g.  It is simply impractical to pull more than a few hours of data by querying the WADPeformanceCountersTable directly.  It is way too slow and way too much data for longer term analysis.

The Importance of Long Running Data

For lots of operations, you can just look at the last 2 hours of your data and see how your service has been doing.  We put that view as the default view you see when charting your performance counters in  However, you really should back out the data from time to time and observe larger trends.  Here is an example:


This is actual data we had last year during our early development phase of the backend engine that processes all the data.  This is the Average CPU over 8 hours and it doesn't look too bad.  We really can't infer anything from this graph other than we are using about 15-35% of our CPU most of the time.

However, if we back that data out a bit.:


This picture tells a whole different story.  We realized that we were slowly doing more and more work with our CPU that did not correlate with the load.  This was not a sudden shift that happened in a few hours.  This was manifesting itself over weeks.  Very slow, for the same amount of operations, we were using more CPU.  A quick check on memory told us that we were also chewing up more memory:


We eventually figured out the issue and fixed it (serialization issue, btw) - can you tell where?


Eventually, we determined what our threshold CPU usage should be under certain loads by observing long term trends.  Now, we know that if our CPU spikes above 45% for more than 10 mins, it means something is amiss.  We now alert ourselves when we detect high CPU usage:


Similarly, we do this for many other counters as well.  There is no magic threshold to choose, but if you have enough data you will be able to easily pick out the threshold values for counters in your own application.

In the next post, I will talk about how we pull this data together analyzers, notifications, and automatically scale to meet demand.

Shameless plug:  Interesting in getting your own data from Windows Azure and monitoring, alerting, and scaling?  Try for free!