Sunday, 19 February 2012

Choosing What To Monitor In Windows Azure

One of the first questions I often get when onboarding a customer is "What should I be monitoring?".  There is no definitive list, but there are certainly some things that tend to be more useful.  I recommend the following Performance Counters in Windows Azure for all role types at bare minimum:

  • \Processor(_Total)\% Processor Time
  • \Memory\Available Bytes
  • \Memory\Committed Bytes
  • \.NET CLR Memory(_Global_)\% Time in GC

You would be surprised how much these 4 counters tell someone without any other input.  You can see trends over time very clearly when monitoring over weeks that will tell you what is a 'normal' range that your application should be in.  If you start to see any of these counters spike (or spike down in the case of Available Memory), this should be an indicator to you that something is going on that you should care about.

Web Roles

For ASP.NET applications, there are some additional counters that tend to be pretty useful:

  • \ASP.NET Applications(__Total__)\Requests Total
  • \ASP.NET Applications(__Total__)\Requests/Sec
  • \ASP.NET Applications(__Total__)\Requests Not Authorized
  • \ASP.NET Applications(__Total__)\Requests Timed Out
  • \ASP.NET Applications(__Total__)\Requests Not Found
  • \ASP.NET Applications(__Total__)\Request Error Events Raised
  • \Network Interface(*)\Bytes Sent/sec

If you are using something other than the latest version of .NET, you might need to choose the version specific instances of these counters.  By default, these are going to only work for .NET 4 ASP.NET apps.  If you are using .NET 2 CLR apps (including .NET 3.5), you will want to choose the version specific counters.

The last counter you see in this list is somewhat special as it includes a wildcard instance (*).  This is important to choose in Windows Azure as the names of the actual instance adapter can (and tends to) change over time and deployments.  Sometimes it is "Local Area Connection* 12", sometimes it is "Microsoft Virtual Machine Bus Network Adapter".  The latter one tends to be the one that you see most often with data, but just to be sure, I would include them all.  Note, this is not an exhaustive list - if you have custom counters or additional system counters that are meaningful, by all means, include them.  In AzureOps, we can set these remotely on your instances using the property page for your deployment.


Choosing a Sample Rate

You should not need to sample any counter faster than 30 seconds.  Period.  In fact, in 99% of all cases, I would actually recommend 120 seconds (that is our default we recommend in AzureOps).  This might seem like you are losing too much data or that you are going to miss something.  However, experience has shown that this sample rate is more than sufficient to monitor the system over days, weeks, and months with enough resolution to know what is happening in your application.  The difference between 30 seconds and 120 seconds is 4 times as much data.  When you sample at 1 and 5 second sample rates, you are talking about 120x and 24x the amount of data.  That is per instance, by the way.  If you are have more than 1 instance, now multiply that by number of instances.  It will quickly approach absurd quantities of data that costs you money in transactions and storage to store, and that has no additional value to parse, but a lot more pain to keep.  Resist the urge to put 1, 5, or even 10 seconds - try 120 seconds to start and tune down if you really need to.


The other thing I recommend for our customers is to use tracing in their application.  If you only use the built-in Trace.TraceInformation (and similar), you are ahead of the game.  There is an excellent article in MSDN about how to setup more advanced tracing with TraceSources that I recommend as well.

I recommend using tracing for a variety of reasons.  First, it will definitely help you when your app is running in the cloud and you want to gain insight into issues you see.  If you had logged exceptions to Trace or critical code paths to Trace, then you now have potential insight into your system.  Additionally, you can use this as a type of metric in the system to be mined later.  For instance, you can log length of time a particular request or operation is taking.  Later, you can pull those logs and analyze what was the bottleneck in your running application.  Within AzureOps, we can parse trace messages in variety of ways (including semantically).  We use this functionality to alert ourselves when something strange is happening (more on this in a later post).

The biggest obstacle I see with new customers is remembering to turn on transfer for their trace messages.  Luckily, within AzureOps, this is again easy to do.  Simply set a Filter Level and a Transfer Interval (I recommend 5 mins).


The Filter Level will depend a bit on how you use the filtering in your own traces.  I have seen folks that trace rarely, so lower filters are fine.  However, I have also seen customers trace upwards of 500 traces/sec.  As a point of reference, at that level of tracing, you are talking about 2GB of data on the wire each minute if you transfer at that verbosity.  Heavy tracers, beware!  I usually recommend verbose for light tracers and Warning for tracers that are instrumenting each method for instance.  You can always change this setting later, so don't worry too much right now.

Coming Up

In the next post, I will walk you through how to setup your diagnostics in Windows Azure and point out some common pitfalls that I see.