Tuesday, January 29, 2013

Anatomy of a Scalable Task Scheduler

On 1/18 we quietly released a version of our scalable task scheduler (creatively named 'Scheduler' for right now) to the Windows Azure Store.  If you missed it, you can see it in this post by Scott Guthrie.  The service allows you to schedule re-occurring tasks using the well-known cron syntax.  Today, we support a simple GET webhook that will notify you each time your cron fires.  However, you can be sure that we are expanding support to more choices, including (authenticated) POST hooks, Windows Azure Queues, and Service Bus Queues to name a few.

In this post, I want to share a bit about how we designed the service to support many tenants and potentially millions of tasks.  Let's start with a simplified, but accurate overall picture:

image

We have several main subsystems in our service (REST API fa├žade, CRON Engine, and Task Engine) and additionally several shared subsystems across additional services (not pictured) such as Monitoring/Auditing and Billing/Usage.  Each one can be scaled independently depending on our load and overall system demand.  We knew that we needed to decouple our subsystems such that they did not depend on each other and could scale independently.  We also wanted to be able to develop each subsystem potentially in isolation without affecting the other subsystems in use.  As such, our systems do not communicate with each other directly, but only share a common messaging schema.  All communication is done over queues and asynchronously.

REST API

This is the layer that end users communicate with and the only way to interact with the system (even our portal acts as a client).  We use a shared secret key authentication mechanism where you sign your requests and we validate them as they enter our pipeline.  We implemented this REST API using Web API.  When you interact with the REST API, you are viewing fast, lightweight views of your scheduled task setup that reflects what is stored in our Job Repository.  However, we never query the Job Repository directly to keep it responsive to its real job - providing the source data for the CRON Engine.

CRON Engine

This subsystem was designed to do as little as possible and farm out the work to the Task Engine.  When you have an engine that evaluates cron expressions and fire times, it cannot get bogged down trying to actually do the work.  This is a potentially IO-intensive role in the subsystem that is constantly evaluating when to fire a particular cron job.  In order to support many tenants, it must be able run continuously without bogging down in execution.  As such, this role only evaluates when a particular cron job must run and then fires a command to the Task Engine to actually execute the potentially long running job.

Task Engine

The Task Engine is the grunt of the service and it performs the actual work.  It is the layer that will be scaled most dramatically depending on system load.  Commands from the CRON Engine for work are accepted and performed at this layer.  Subsequently, when the work is done it emits an event that other interested subsystems (like Audit and Billing) can subscribe to downstream.  The emitted event contains details about the outcome of the task performed and is subsequently denormalized into views that the REST API can query to provide back to a tenant.  This is how we can tell you your job history and report back any errors in execution.  The beauty of the Task Engine emitting events (instead of directly acting) is that we can subscribe many different listeners for a particular event at any time in the future.  In fact, we can orchestrate very complex workflows throughout the system as we communicate to unrelated, but vital subsystems.  This keeps our system decoupled and allows us to develop those other subsystems in isolation.

Future Enhancements

Today we are in a beta mode, intended to give us feedback about the type of jobs, frequency of execution, and what our system baseline performance should look like.  In the future, we know we will support additional types of scheduled tasks, more views into your tasks, and more complex orchestrations.  Additionally, we have setup our infrastructure such that we can deploy to to multiple datacenters for resiliency (and even multiple clouds).  Give us a try today and let us know about your experience.

Thursday, December 20, 2012

Setting ClaimsAuthenticationManager Programmatically in .NET 4.5

This is a quick post today that might save folks the same trouble I had to go through when upgrading my Windows Identity Foundation (WIF) enabled MVC website to the latest version of .NET.  The scenario is that you might want to enrich the claims coming from your STS with additional claims of your choosing.  To do this, there is a common technique of creating a class the derives from ClaimsAuthenticationManager and overrides the Authenticate method.  Consider this sample ClaimsAuthenticationManager:

The issue we have is that we need to provide an implementation of ITenantRepository here in order to lookup the data for the additional claims we are adding.  If you are lucky enough to find the article on MSDN, it will show you how to wire in a custom ClaimsAuthenticationManager using the web.config.  I don't want to hardcode references to an implementation of my TenantRepository, so using config is not a great option for me.

In the older WIF model (Microsoft.IdentityModel) for .NET <= 4.0, you hooked the ServiceConfigurationCreated event:

But, in .NET 4.5, all of the namespaces and a lot of the classes are updated (System.IdentityModel).  It took me a long time in Reflector to figure out how to hook the configuration being created again.  Turns out you need to reference System.IdentityModel.Services and find the FederatedAuthentication class.  Here you go:

Happy WIF-ing.

Wednesday, October 10, 2012

Next Stop: Aditi Technologies

I am excited to announce that I have officially joined Aditi Technologies as Director of Product Services.  Taking what I have learned building large scale solutions in Windows Azure, I will be responsible for building Aditi's own portfolio of SaaS services and IP/frameworks.  We have a number of exciting projects underway and I hope to blog more about what we are building soon.

Along with this move, I get to rejoin Wade (now my boss!) and Steve as well as some of my former Cumulux colleagues.  I took this role because I see a great opportunity to build software and services in the 'cloud' and I am convinced that Aditi has been making the right investments.  It doesn't hurt at all that I get to work with top-notch technologists either.

Along the way, I plan to build a team to deliver on these cloud services.  If you think you have what it takes to build great software, send me a note and your resume.  Thanks!

 

image

Friday, May 25, 2012

Interpreting Diagnostics Data and Making Adjustments

At this point in our diagnostics saga, we have our instances busily pumping out the data we need to manage and monitor our services.  However, it is simply putting the raw data in our storage account(s).  What we really want to do is query and analyze that data to figure out what is happening.

The Basics

Here I am going to show you the basic code for querying your data.  For this, I am going to be using LINQPad.  It is a tool that is invaluable for ad hoc querying and prototyping.  You can cut & paste the following script (hit F4 and add references and namespaces for Microsoft.WindowsAzure.StorageClient.dll and System.Data.Service.Client.dll as well).

void Main()
{

    var connectionString = "DefaultEndpointsProtocol=https;AccountName=youraccount;AccountKey=yourkey";
    var account = CloudStorageAccount.Parse(connectionString);
    var client = account.CreateCloudTableClient();
     
    var ctx = client.GetDataServiceContext();
    
    var deploymentId = new Guid("25d676fb-f031-42b4-aae1-039191156d1a").ToString("N").Dump();
    
    var q = ctx.CreateQuery<PerfCounter>("WADPerformanceCountersTable")
        .Where(f => f.RowKey.CompareTo(deploymentId) > 0 && f.RowKey.CompareTo(deploymentId + "__|") < 0)
        .Where(f => f.PartitionKey.CompareTo(DateTime.Now.AddHours(-2).GetTicks()) > 0)
        //.Take(1)
        .AsTableServiceQuery()
        .Dump();

    //(q as DataServiceQuery<Foo>).RequestUri.AbsoluteUri.Dump(); 
    //(q as CloudTableQuery<Foo>).Expression.Dump();
}

static class Funcs
{
    public static string GetTicks(this DateTime dt)
    {
        return dt.Ticks.ToString("d19");
    }
}

[System.Data.Services.Common.DataServiceKey("PartitionKey", "RowKey")]
class PerfCounter
{
    public string PartitionKey { get; set; }
    public string RowKey { get; set; }
    public DateTime Timestamp { get; set; }
    public long EventTickCount { get; set; }
    public string Role { get; set; }
    public string DeploymentId { get; set; }
    public string RoleInstance { get; set; }
    public string CounterName { get; set; }
    public string CounterValue { get; set; }
    public int Level { get; set; }
    public int EventId { get; set; }
    public string Message { get; set; }
}

What I have done here is setup a simple script that allows me to query the table storage location for performance counters.  There are two big (and 1 little) things to note here:

  1. Notice how I am filtering down to the deployment ID (also called Private ID) of the deployment I am interested in seeing.  If you use same storage account for multiple deployments, this is critical.
  2. Also, see how I have properly formatted the DateTime such that I can select a time range from the Partition Key appropriated.  In this example, I am retrieving the last 2 hours of data for all roles in the selected deployment.
  3. I have also commented out some useful checks you can use to test your filters.  If you uncomment the DataServiceQuery<T> line, you also should comment out the .AsTableServiceQuery() line.

Using the Data

If you haven't set absurd sample rates, you might actually get this data back in a reasonable time.  If you have lots of performance counters to monitor and/or you have high sample rates, be prepared to sit and wait for awhile.  Each tick is a single row in table storage.  You can return 1000 rows in a single IO operation.  It can take a very long time if you ask for large time ranges or have lots of data.

Once you have the query returned, you can actually export it into Excel using LINQPad and go about setting up graphs and pivot tables, etc.  This is all very doable, but also tedious.  I would not recommend this for long term management, but rather some simple point in time reporting perhaps.

For AzureOps.com, we went a bit further.  We collect the raw data, compress, and index it for highly efficient searches by time.  We also scale the data for the time range, otherwise you can have a very hard time graphing 20,000 data points.  This makes it very easy to view both recent data (e.g. last few hours) as well as data over months.  The value of the longer term data cannot be overstated.

Anyone that really wants to know what their service has been doing will likely need to invest in monitoring tools or services (e.g. AzureOps.com).  It is simply impractical to pull more than a few hours of data by querying the WADPeformanceCountersTable directly.  It is way too slow and way too much data for longer term analysis.

The Importance of Long Running Data

For lots of operations, you can just look at the last 2 hours of your data and see how your service has been doing.  We put that view as the default view you see when charting your performance counters in AzureOps.com.  However, you really should back out the data from time to time and observe larger trends.  Here is an example:

image

This is actual data we had last year during our early development phase of the backend engine that processes all the data.  This is the Average CPU over 8 hours and it doesn't look too bad.  We really can't infer anything from this graph other than we are using about 15-35% of our CPU most of the time.

However, if we back that data out a bit.:

image

This picture tells a whole different story.  We realized that we were slowly doing more and more work with our CPU that did not correlate with the load.  This was not a sudden shift that happened in a few hours.  This was manifesting itself over weeks.  Very slow, for the same amount of operations, we were using more CPU.  A quick check on memory told us that we were also chewing up more memory:

image

We eventually figured out the issue and fixed it (serialization issue, btw) - can you tell where?

image

Eventually, we determined what our threshold CPU usage should be under certain loads by observing long term trends.  Now, we know that if our CPU spikes above 45% for more than 10 mins, it means something is amiss.  We now alert ourselves when we detect high CPU usage:

image

Similarly, we do this for many other counters as well.  There is no magic threshold to choose, but if you have enough data you will be able to easily pick out the threshold values for counters in your own application.

In the next post, I will talk about how we pull this data together analyzers, notifications, and automatically scale to meet demand.

Shameless plug:  Interesting in getting your own data from Windows Azure and monitoring, alerting, and scaling?  Try AzureOps.com for free!

Monday, April 16, 2012

Getting Diagnostics Data From Windows Azure

Assuming you know what to monitor and you have configured your deployments to start monitoring, now you need to actually get the data and do something with it.

First, let's briefly recap how the Diagnostics Manager (DM) stores data.  Once it has been configured, the DM will start to buffer data to disk locally on the VM using the temporal scratch disk*.  It will buffer it using the quota policy found in configuration.  By default, this allocates 4GB of local disk space to hold diagnostics data.  You can change the quota with a little more work if you need to hold more, but most folks should be served just fine with the default.  Data is buffered as FIFO (first in, first out) in order to age out the oldest data first.

Scheduled versus OnDemand

Once the data is buffering locally on the VM, you need to somehow transfer the data from the VM to your cloud storage account.  You can do this by either setting a Scheduled or OnDemand transfer.  In practice, I tend to recommend always using Scheduled transfers and ignoring the OnDemand option (it ends up being a lot easier). 

But, for completeness, here is an example of setting an OnDemand transfer:

void Main()
{
    var account = new CloudStorageAccount(
        new StorageCredentialsAccountAndKey("dunnry", "yourkey"),
        true
        );
        
    var mgr = new DeploymentDiagnosticManager(account, "6468a8b749a54c3...");
    
    foreach (string role in mgr.GetRoleNames())
    {
        var ridm = mgr.GetRoleInstanceDiagnosticManagersForRole(role);
        
        var options = new OnDemandTransferOptions()
        {
            From = DateTime.UtcNow - TimeSpan.FromMinutes(10),
            To = DateTime.UtcNow,
            NotificationQueueName = "pollme"
        };
        
        var qc = account.CreateCloudQueueClient();
        var q = qc.GetQueueReference("pollme");
        q.CreateIfNotExist();
        
        foreach (var i in ridm)
        {
            //cancel all pending transfers
            foreach (var pt in i.GetActiveTransfers())
            {
                i.CancelOnDemandTransfers(pt.Key);
            }
            
            var key = i.BeginOnDemandTransfer(DataBufferName.Logs, options);
            //poll here... why bother...
        }
    }
}

It's not exactly straightforward, but essentially, you need to specify the time range to transfer and optionally a queue to notify when completed.  You must ensure that all outstanding OnDemand transfers are canceled and then you can begin the transfer and ideally you should also cancel the transfer when it is completed.  In theory, this gives you some flexibility on what you want transferred.

As with most things in life, there are some gotchas to using this code.  Most of the time, folks forget to cancel the transfer after it completes.  When that happens, it prevents any updates to the affected data source.  This can impact you when you try to set new performance counters and see an error about an OnDemand transfer for instance.  As such, you end up writing a lot of code to detect and cancel pending transfers first before doing anything else in the API.

Using Scheduled transfers ends up being easier in the long run because you end up getting the same amount of data, but without having the pain of remembering to cancel pending transfers and all that.  Here is similar code (you should adapt for each data source you need to transfer):

void Main()
{
    var account = new CloudStorageAccount(
        new StorageCredentialsAccountAndKey("dunnry", "yourkey"),
        true
        );
        
    var mgr = new DeploymentDiagnosticManager(account, "6468a8b749a54c3...");
    
    foreach (string role in mgr.GetRoleNames())
    {
        var ridm = mgr.GetRoleInstanceDiagnosticManagersForRole(role);
        
        foreach (var idm in ridm)
        {
            var config = idm.GetCurrentConfiguration()
                ?? DiagnosticMonitor.GetDefaultInitialConfiguration();
            config.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);
            
            //set other scheduled intervals here...

            idm.SetCurrentConfiguration(config);
        }
    }
}

This ends up being the technique we use for AzureOps.com.  When you setup your subscription with us, we detect the diagnostics connection string and allow you to change your data source settings.  For Performance Counters, we force the transfer to 5 minutes to today (a good compromise) and allow you to choose the interval for other sources (i.e. Traces, Windows Event Logs).  When you use a provider like AzureOps, it is usually best to stream the data in in relatively small chunks as opposed to say transferring once an hour.  Firstly, we won't be able to do anything with your data until we see it and you probably want to be notified sooner than 1 time an hour.  Secondly, when you set long transfer period times, there is a risk that you exceed the buffer quota and start to lose data that was never transferred.  In practice, we have not observed any noticeable overhead by transferring more often.  When in doubt, pick 5 mins.

Whew!  If you have made it this far, you now have a reasonable set of performance counters and trace information that is both being collected on your VMs in Windows Azure as well as being persisted to your storage account.  So, essentially, you need to now figure out what to do with that data.  That will be the subject of the next post in this series.

 

*if you are interested, RDP into an instance and check the resource drive (usually C:) under /Resources/Directory/<roleuniquename>/Monitor to see buffered data.

Monday, February 27, 2012

Setting Up Diagnostics Monitoring In Windows Azure

In order to actually monitor anything in Windows Azure, you need to use the Diagnostics Manager (DM) that ships out of box.  SaaS providers like AzureOps rely on this data in order to tell you how your system is behaving.  The DM actually supports a few data sources that it can collect and transfer:

  • Performance Counters
  • Trace logs
  • IIS Logs
  • Event Logs
  • Infrastructure Logs
  • Arbitrary logs

One of the most common issues I hear from customers is that they don't know how to get started using the DM or they think they are using it and just cannot find the data where they think they should.  Hopefully, this post will clear up a bit about how the DM actually works and how to configure it.  The next post will talk about how to get the data once you are setup.

Setting UP The Diagnostics Manager

Everything starts by checking a box.  When you check the little box in Visual Studio that says "Enable Diagnostics", it actually modifies your Service Definition to include a role plugin.  Role plugins are little things that can add to your definition and configuration similar to a macro.  If you have ever used Diagnostics or the RDP capability in Windows Azure, you have used a role plugin.  For most of their history, these plugins have been exclusively built by Microsoft, but there is really nothing stopping you from using it yourself (that is another topic).

image

image

If we check our SDK folder in the plugins directory in the 'diagnostics' folder, you will actually find the magic that is used to launch the DM.

<?xml version="1.0" ?>
<RoleModule 
  xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition"
  namespace="Microsoft.WindowsAzure.Plugins.Diagnostics">
  <Startup priority="-2">
    <Task commandLine="DiagnosticsAgent.exe" executionContext="limited" taskType="background" />
    <Task commandLine="DiagnosticsAgent.exe /blockStartup" executionContext="limited" taskType="simple" />
  </Startup>
  <ConfigurationSettings>
    <Setting name="ConnectionString" />
  </ConfigurationSettings>
</RoleModule>

Here, we can see that the DM is implemented as a pair of startup tasks.  Notice, it is using a task type of background (the other is blocking until it gets going).  This means that the DM exists outside the code you write and should be impervious to your code crashing and taking it down as well.  You can also see that the startup tasks listed here will be run with a default priority of -2.  This just tries to ensure that they run before any other startup tasks.  The idea is that you want the DM to start before other stuff so it can collect data for you.

You can also see in the definition the declaration of a new ConfigurationSettings with a single Setting called 'ConnectionString'.  If you are using Visual Studio when you import the Diagnostics plugin, you will see that the tooling automatically combines the namespace with the settings name and creates a new Setting called Microsoft.Windows.Plugins.Diagnostics.ConnectionString.  This setting will not exist if you are building your csdef or cscfg files by hand.  You must remember to include it.

Once you have the plugin actually enabled in your solution, you will need to specify a valid connection string in order for the DM to operate.  Here you have two choices:

  1. Running in emulation, it is valid to use "UseDevelopmentStorage=true" as ConnectionString.
  2. Before deploying to cloud, you must remember to update that to a valid storage account (i.e. "DefaultEndpointsProtocol=https;AccountName=youraccount;AccountKey=nrIXB.")

Common Pitfalls

It seems simple enough, but here come the first set of common pitfalls I see:

  1. Forgetting to set the ConnectionString to a valid storage account and deploying with 'UseDevelopmentStorage=true'.  This has become less of a factor in 1.6+ SDK tooling because you will notice the checkbox that says, "Use publish storage account as connection string when you publish to Windows Azure".  However, tooling will not help you here for automated deploys or when you forget to check that box.
  2. Using "DefaultEndpointsProtocol=http" in the connection string (note the missing 's' from 'https').  While it is technically possible to use the DM with an http connection, it is not worth the hassle.  Just use https and save yourself the hassle of troubleshooting this later.
  3. Setting an invalid connection string.  Hard to believe, but I see it all the time now on AzureOps.  This usually falls into two categories: deleting a storage account, and regenerating a storage key.  If you delete a storage account, but forget to remove that as the ConnectionString, things won't work (shocking, I know).  Further, if you decide to regenerate the primary or secondary storage keys and you were using them, stuff won't work here either.  Seems obvious, but you won't actually get any warning on this.  Stuff won't work and you will have to figure that out yourself.  A good 3rd party provider (like AzureOps) will let you know however.
  4. Forgetting to co-locate the diagnostics storage account with the hosted service.  This one might not show itself until you see the bill.  The diagnostics agent can be pretty chatty.  I have seen GBs of data logged in a single minute.  Forgetting to co-locate that would run you a pretty hefty bandwidth bill in addition to slowing you down.

Best Practices

Setting up the Diagnostics Manager is not terribly hard, but easy to get wrong if you are not familiar with it.  There are some other subtle things you can do here that will shoot yourself in the foot however.  Here are some things you can do that will make your life easier:

  1. Always separate your diagnostics storage account from other storage accounts.  This is especially important for production systems.  You do not want diagnostics competing with your primary storage account for resources.  There is an account wide, 5000 transactions per second limit across tables, queues, and blobs.  When you use a single storage account for both, you could unintentionally throttle your production account.
  2. If possible, use a different diagnostics storage account per hosted service.  If that is not practical, at least try to separate storage accounts for production versus non-production systems.  It turns out that querying diagnostics data can be difficult if there are many different systems logging to the same diagnostics tables.  What I have seen many times is someone use the same diagnostics account for load testing against non-production systems as their production system.  What happens is that the amount of data for the non-production system can greatly exceed the production systems.  The query mechanism for then finding production data is akin to finding a needle in the haystack.  It can take a very long time in some cases to query even for simple things.
  3. Don't' use the Anywhere location.  This applies to all storage accounts and all hosted services.  This might seem obvious, but I see it all the time.  It is possible to use Anywhere location with affinity groups and avoid pitfall #4, but it is not worth the hassle.  Additionally, if you have a 3rd party (like AzureOps) that is monitoring your data, we cannot geo-locate a worker next to you to pull your data.  We won't know where you are located and it could mean big bandwidth bills for you.

At this point, if you have enabled the DM, and remembered to set a valid connection string, you are almost home.  The last thing to do is actually get the data and avoid common pitfalls there.  That is the topic for the next post.

Sunday, February 19, 2012

Choosing What To Monitor In Windows Azure

One of the first questions I often get when onboarding a customer is "What should I be monitoring?".  There is no definitive list, but there are certainly some things that tend to be more useful.  I recommend the following Performance Counters in Windows Azure for all role types at bare minimum:

  • \Processor(_Total)\% Processor Time
  • \Memory\Available Bytes
  • \Memory\Committed Bytes
  • \.NET CLR Memory(_Global_)\% Time in GC

You would be surprised how much these 4 counters tell someone without any other input.  You can see trends over time very clearly when monitoring over weeks that will tell you what is a 'normal' range that your application should be in.  If you start to see any of these counters spike (or spike down in the case of Available Memory), this should be an indicator to you that something is going on that you should care about.

Web Roles

For ASP.NET applications, there are some additional counters that tend to be pretty useful:

  • \ASP.NET Applications(__Total__)\Requests Total
  • \ASP.NET Applications(__Total__)\Requests/Sec
  • \ASP.NET Applications(__Total__)\Requests Not Authorized
  • \ASP.NET Applications(__Total__)\Requests Timed Out
  • \ASP.NET Applications(__Total__)\Requests Not Found
  • \ASP.NET Applications(__Total__)\Request Error Events Raised
  • \Network Interface(*)\Bytes Sent/sec

If you are using something other than the latest version of .NET, you might need to choose the version specific instances of these counters.  By default, these are going to only work for .NET 4 ASP.NET apps.  If you are using .NET 2 CLR apps (including .NET 3.5), you will want to choose the version specific counters.

The last counter you see in this list is somewhat special as it includes a wildcard instance (*).  This is important to choose in Windows Azure as the names of the actual instance adapter can (and tends to) change over time and deployments.  Sometimes it is "Local Area Connection* 12", sometimes it is "Microsoft Virtual Machine Bus Network Adapter".  The latter one tends to be the one that you see most often with data, but just to be sure, I would include them all.  Note, this is not an exhaustive list - if you have custom counters or additional system counters that are meaningful, by all means, include them.  In AzureOps, we can set these remotely on your instances using the property page for your deployment.

image

Choosing a Sample Rate

You should not need to sample any counter faster than 30 seconds.  Period.  In fact, in 99% of all cases, I would actually recommend 120 seconds (that is our default we recommend in AzureOps).  This might seem like you are losing too much data or that you are going to miss something.  However, experience has shown that this sample rate is more than sufficient to monitor the system over days, weeks, and months with enough resolution to know what is happening in your application.  The difference between 30 seconds and 120 seconds is 4 times as much data.  When you sample at 1 and 5 second sample rates, you are talking about 120x and 24x the amount of data.  That is per instance, by the way.  If you are have more than 1 instance, now multiply that by number of instances.  It will quickly approach absurd quantities of data that costs you money in transactions and storage to store, and that has no additional value to parse, but a lot more pain to keep.  Resist the urge to put 1, 5, or even 10 seconds - try 120 seconds to start and tune down if you really need to.

Tracing

The other thing I recommend for our customers is to use tracing in their application.  If you only use the built-in Trace.TraceInformation (and similar), you are ahead of the game.  There is an excellent article in MSDN about how to setup more advanced tracing with TraceSources that I recommend as well.

I recommend using tracing for a variety of reasons.  First, it will definitely help you when your app is running in the cloud and you want to gain insight into issues you see.  If you had logged exceptions to Trace or critical code paths to Trace, then you now have potential insight into your system.  Additionally, you can use this as a type of metric in the system to be mined later.  For instance, you can log length of time a particular request or operation is taking.  Later, you can pull those logs and analyze what was the bottleneck in your running application.  Within AzureOps, we can parse trace messages in variety of ways (including semantically).  We use this functionality to alert ourselves when something strange is happening (more on this in a later post).

The biggest obstacle I see with new customers is remembering to turn on transfer for their trace messages.  Luckily, within AzureOps, this is again easy to do.  Simply set a Filter Level and a Transfer Interval (I recommend 5 mins).

image

The Filter Level will depend a bit on how you use the filtering in your own traces.  I have seen folks that trace rarely, so lower filters are fine.  However, I have also seen customers trace upwards of 500 traces/sec.  As a point of reference, at that level of tracing, you are talking about 2GB of data on the wire each minute if you transfer at that verbosity.  Heavy tracers, beware!  I usually recommend verbose for light tracers and Warning for tracers that are instrumenting each method for instance.  You can always change this setting later, so don't worry too much right now.

Coming Up

In the next post, I will walk you through how to setup your diagnostics in Windows Azure and point out some common pitfalls that I see.

 

 

Monitoring in Windows Azure

For the last year since leaving Microsoft, I have been deeply involved in building a world class SaaS monitoring service called AzureOps.  During this time, it was inevitable that I see not only how to best monitor services running in Windows Azure, but also see the common pitfalls amongst our beta users.  It is one thing to be a Technical Evangelist like I was and occasionally use a service for a demo or two, and quite another to attempt to build a business on it.

Monitoring a running service in Windows Azure can actually be daunting if you have not worked with it before.  In this coming series, I will attempt to share the knowledge we have gained building AzureOps and from our customers.  The series will be grounded in these 5 areas:

  1. Choosing what to monitor in Windows Azure.
  2. Getting the diagnostics data from Windows Azure
  3. Interpreting diagnostics data and making adjustments
  4. Maintaining your service in Windows Azure.

Each one of these areas will be a post in the series and I will update this post to keep a link to the latest.  I will use AzureOps as an example in some cases to highlight both what we learned as well as the approach we take now due to this experience.

If you are interested in monitoring your own services in Windows Azure, grab an invite and get started today!.

Wednesday, August 24, 2011

Handling Continuation Tokens in Windows Azure - Gotcha

I spent the last few hours debugging an issue where a query in Windows Azure table storage was not returning any results, even though I knew that data was there.  It didn't start that way of course.  Rather, stuff that should have been working and previously was working, just stopped working.  Tracing through the code and debugging showed me it was a case of a method not returning data when it should have.

Now, I have known for quite some time that you must handle continuation tokens and you can never assume that a query will return data always (Steve talks about it waaaay back when here).  However, what I did not know was that different methods of enumeration will give you different results.  Let me explain by showing the code.

var q = this.CreateQuery()
    .Where(filter)
    .Where(f => f.PartitionKey.CompareTo(start.GetTicks()) > 0)
    .Take(1)
    .AsTableServiceQuery();
var first = q.FirstOrDefault();
if (first != null)
{
    return new DateTime(long.Parse(first.PartitionKey));
}

In this scenario, you would assume that you have continuation tokens nailed because you have the magical AsTableServiceQuery extension method in use.  It will magically chase the tokens until conclusion for you.  However, this code does not work!  It will actually return null in cases where you do not hit the partition server that holds your query results on the first try.

I could easily reproduce the query in LINQPad:

var q = ctx.CreateQuery<Foo>("WADPerformanceCountersTable")
    .Where(f => f.RowKey.CompareTo("9232a4ca79344adf9b1a942d37deb44a") > 0 && f.RowKey.CompareTo("9232a4ca79344adf9b1a942d37deb44a__|") < 0)
    .Where(f => f.PartitionKey.CompareTo(DateTime.Now.AddDays(-30).GetTicks()) > 0)
    .Take(1)
    .AsTableServiceQuery()
    .Dump();    

Yet, this query worked perfectly.  I got exactly 1 result as I expected.  I was pretty stumped for a bit, then I realized what was happening.  You see FirstOrDefault will not trigger the enumeration required to generate the necessary two round-trips to table storage (first one gets continuation token, second gets results).  It just will not force the continuation token to be chased.  Pretty simple fix it turns out:

var first = q.AsEnumerable().SingleOrDefault();

Hours wasted for that one simple line fix.  Hope this saves someone the pain I just went through.

Thursday, July 14, 2011

How to Diagnose Windows Azure Error Attaching Debugger Errors

I was working on a Windows Azure website solution the other day and suddenly started getting this error when I tried to run the site with a debugger:

image

This error is one of the hardest to diagnose.  Typically, it means that there is something crashing in your website before the debugger can attach.  A good candidate to check is your global.asax to see if you have changed anything there.  I knew that the global.asax had not been changed, so it was puzzling.  Naturally, I took the normal course of action:

  1. Run the website without debug inside the emulator.
  2. Run the website with and without debugging outside the emulator.
  3. Tried it on another machine

None of these methods gave me any clue what the issue was as they all worked perfectly fine.  It was killing me that it only happened on debugging inside the emulator and only on 1 machine (the one I really wanted to work).  I was desperately looking for a solution that did not involve rebuilding the machine.   I turned on SysInternal's DebugView to see if there were some debug messages telling me what the message was.  I saw an interesting number of things, but nothing that really stood out as the source of the error.  However, I did notice the process ID of what appeared to be reporting errors:

image

Looking at Process Explorer, I found this was for DFAgent.exe (the Dev Fabric Agent).  I could see that it was starting with an environment variable, so I took a look at where that was happening:

image

That gave me a direction to start looking.  I opened the %UserProfile%\AppData\Local\Temp directory and found a conveniently named file there called Visual Studio Web Debugger.log. 

image

A quick look at it showed it to be HTML, so one rename later and viola!

image

One of our developers had overridden the <httpErrors> setting in web.config that was disallowed on my 1 machine.  I opened my applicationHost.config using a Administatrive Notepad and sure enough:

image

So, the moral of the story is next time, just take a look at this log file and you might find the issue.  I suspect the reason that this only happened on debug and not when running without the debugger was that for some reason the debugger is looking for a file called debugattach.aspx.  Since this file does not exist on my machine, it throws a 404, which in turn tries to access the <httpErrors> setting, which culminates in the 500.19 server error.  I hope this saves someone the many hours I spent finding it and I hope it prevents you from rebuilding your machine as I almost did.