Monday, 16 April 2012

Getting Diagnostics Data From Windows Azure

Assuming you know what to monitor and you have configured your deployments to start monitoring, now you need to actually get the data and do something with it.

First, let's briefly recap how the Diagnostics Manager (DM) stores data.  Once it has been configured, the DM will start to buffer data to disk locally on the VM using the temporal scratch disk*.  It will buffer it using the quota policy found in configuration.  By default, this allocates 4GB of local disk space to hold diagnostics data.  You can change the quota with a little more work if you need to hold more, but most folks should be served just fine with the default.  Data is buffered as FIFO (first in, first out) in order to age out the oldest data first.

Scheduled versus OnDemand

Once the data is buffering locally on the VM, you need to somehow transfer the data from the VM to your cloud storage account.  You can do this by either setting a Scheduled or OnDemand transfer.  In practice, I tend to recommend always using Scheduled transfers and ignoring the OnDemand option (it ends up being a lot easier). 

But, for completeness, here is an example of setting an OnDemand transfer:

void Main()
{
    var account = new CloudStorageAccount(
        new StorageCredentialsAccountAndKey("dunnry", "yourkey"),
        true
        );
        
    var mgr = new DeploymentDiagnosticManager(account, "6468a8b749a54c3...");
    
    foreach (string role in mgr.GetRoleNames())
    {
        var ridm = mgr.GetRoleInstanceDiagnosticManagersForRole(role);
        
        var options = new OnDemandTransferOptions()
        {
            From = DateTime.UtcNow - TimeSpan.FromMinutes(10),
            To = DateTime.UtcNow,
            NotificationQueueName = "pollme"
        };
        
        var qc = account.CreateCloudQueueClient();
        var q = qc.GetQueueReference("pollme");
        q.CreateIfNotExist();
        
        foreach (var i in ridm)
        {
            //cancel all pending transfers
            foreach (var pt in i.GetActiveTransfers())
            {
                i.CancelOnDemandTransfers(pt.Key);
            }
            
            var key = i.BeginOnDemandTransfer(DataBufferName.Logs, options);
            //poll here... why bother...
        }
    }
}

It's not exactly straightforward, but essentially, you need to specify the time range to transfer and optionally a queue to notify when completed.  You must ensure that all outstanding OnDemand transfers are canceled and then you can begin the transfer and ideally you should also cancel the transfer when it is completed.  In theory, this gives you some flexibility on what you want transferred.

As with most things in life, there are some gotchas to using this code.  Most of the time, folks forget to cancel the transfer after it completes.  When that happens, it prevents any updates to the affected data source.  This can impact you when you try to set new performance counters and see an error about an OnDemand transfer for instance.  As such, you end up writing a lot of code to detect and cancel pending transfers first before doing anything else in the API.

Using Scheduled transfers ends up being easier in the long run because you end up getting the same amount of data, but without having the pain of remembering to cancel pending transfers and all that.  Here is similar code (you should adapt for each data source you need to transfer):

void Main()
{
    var account = new CloudStorageAccount(
        new StorageCredentialsAccountAndKey("dunnry", "yourkey"),
        true
        );
        
    var mgr = new DeploymentDiagnosticManager(account, "6468a8b749a54c3...");
    
    foreach (string role in mgr.GetRoleNames())
    {
        var ridm = mgr.GetRoleInstanceDiagnosticManagersForRole(role);
        
        foreach (var idm in ridm)
        {
            var config = idm.GetCurrentConfiguration()
                ?? DiagnosticMonitor.GetDefaultInitialConfiguration();
            config.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);
            
            //set other scheduled intervals here...

            idm.SetCurrentConfiguration(config);
        }
    }
}

This ends up being the technique we use for AzureOps.com.  When you setup your subscription with us, we detect the diagnostics connection string and allow you to change your data source settings.  For Performance Counters, we force the transfer to 5 minutes to today (a good compromise) and allow you to choose the interval for other sources (i.e. Traces, Windows Event Logs).  When you use a provider like AzureOps, it is usually best to stream the data in in relatively small chunks as opposed to say transferring once an hour.  Firstly, we won't be able to do anything with your data until we see it and you probably want to be notified sooner than 1 time an hour.  Secondly, when you set long transfer period times, there is a risk that you exceed the buffer quota and start to lose data that was never transferred.  In practice, we have not observed any noticeable overhead by transferring more often.  When in doubt, pick 5 mins.

Whew!  If you have made it this far, you now have a reasonable set of performance counters and trace information that is both being collected on your VMs in Windows Azure as well as being persisted to your storage account.  So, essentially, you need to now figure out what to do with that data.  That will be the subject of the next post in this series.

 

*if you are interested, RDP into an instance and check the resource drive (usually C:) under /Resources/Directory/<roleuniquename>/Monitor to see buffered data.

Thursday, 10 May 2012 09:49:26 (Eastern Daylight Time, UTC-04:00)
Hi Ryan,
Good set of posts, thanks.

I have a question for you - do you have a best practice for cleaning up logs ( whether they are performance/IIS/trace logs) from table storage. There seems to be no way in Azure to natively tell it to age-out or purge old logs.

Mark.
Thursday, 10 May 2012 10:42:07 (Eastern Daylight Time, UTC-04:00)
Hey Mark,

Thanks. The easiest method I would recommend uses two storage accounts for diagnostics. Just swap one to the other every X weeks/months/etc. and then delete all the tables in the unused one. That will be cheaper/easier than trying to delete rows and blobs individually. When you get to hundreds of millions of rows, it is actually more expensive to delete than to keep, so only dropping the entire table in1 transaction makes any sense. The downside is that it can take a long time for the table to be entirely deleted asynchronously, which is why you swap to the other account.
Comments are closed.