Custom Load Balancing Endpoints in an Azure Web/Worker/VM Role

Windows Azure Web, Worker and Virtual Machine roles provide an easy built-in way to customise health monitoring for a load balanced endpoint, allowing you to disable a single endpoint for a role without causing the entire role to recycle. This can be achieved through use of the LoadBalancerProbes schema element, which is available in Azure SDK 1.7+.

Background

The Windows Azure Load Balancer running on the Azure Fabric Service acts as the default controller for determining how to route incoming network traffic to endpoints on your role instances. A default load balancer probe is provided that covers all endpoints for each role instance - this probe is high level and simply returns HTTP 200 OK if the role is in the Ready state (not Busy, Recycling, Stopping etc). If the response is not 200 OK, the load balancer stops all traffic being routed to that instance.

Once the role instance starts returning HTTP 200 again, the load balancer resumes traffic flow. When running a standard web role, your code is usually contained in the w3wp.exe process which isn't actually monitored by the load balancer (so failures like your web application returning Internal Server Error 500 won't stop the role becoming unavailable).

Overriding the default probe

If you override the default probe for an endpoint, you can provide more complex, lower level logic for each individual endpoint in your service. Your probe is checked regularly (every 15 seconds by default) - if your probe responds with a HTTP 200 or TCP ACK within the timeout period (31 seconds by default) then the associated endpoint will have traffic routed to it as normal. If it starts returning any other HTTP codes or TCP messages, it will be removed from load balancing.

Usages

You can use this in multiple ways, for example:

  1. Ensuring only one instance of your role provides a selected endpoint at a time.
  2. Disabling an instance if one of your websites starts returning an unusually large number of HTTP errors for a specified URI.
  3. Removing a single endpoint from load balancer rotation if it becomes overloaded - for example, temporarily disabling new requests to port 80 on a web role if that instance becomes overloaded by a small number of unusually heavy requests (this would normally cause problems given the default load balancing is round robin).
  4. Disabling an endpoint when a custom service becomes unavailable, for example stopping requests to a virtual machine role database if the database is encountering issues (while still allowing requests to all  other services).

Gotchas

  • Overriding the built-in load balancing probe can mean that your replacement probe still returns 200 OK after a role has it's OnStop method() called. You should ensure your probe does the same as the built-in probe and begins returning a non-200 HTTP status code as soon as OnStop() is called.

Example .csdef schema

<ServiceDefinition>
  <LoadBalancerProbes>
    <LoadBalancerProbe name="TestProbe" protocol="{http|tcp}" path="{uri-for-checking-health-status-of-vm}" port="{port-number}" intervalInSeconds="{interval-in-seconds}" timeoutInSeconds="{timeout-in-seconds}" />
  </LoadBalancerProbes>
  <WorkerRole>
  ...
    <Endpoints>
      <InputEndpoint name="HttpIn" protocol="http" port="80" localPort="80" loadBalancerProbe="TestProbe" />
    </Endpoints>
  ...
  </WorkerRole>
</ServiceDefinition>

For a real world example of when a LoadBalancerProbe might be useful, see this post.

LoadBalancerProbe element attributes

  • name - A unique identifier for this probe. Can be referenced by multiple endpoints.
  • protocol - HTTP or TCP. A 200 OK for HTTP or a TCP ACK for TCP means the endpoint should be kept available. All other responses indicate to the Fabric Controller that it should take this endpoint out of load balancing rotation.
  • path - Required for HTTP protocols to specify the URI used for health checking.
  • port - The port number to be used for checking availability. Defaults to the same port number as the endpoint.
  • intervalInSeconds - How frequently (in seconds) to make availability checking requests.
  • timeoutInSeconds - A number of seconds after which, if no success response is received by the availability checks, the endpoint will be removed from the load balancing rotation. A good recommended value is twice that of intervalInSeconds, allowing two full failed requests before disabling traffic to the associated endpoint.

Microsoft's hidden gem: MSDeploy

MSDeploy will be central to many of the posts in this series. MSDeploy is the tool behind the scenes in Web Deploy, but can do much more than deploying IIS sites through it's provider extensions. We've used it's powerful command execution and file syncing functionality for:

  • Deployments to Windows Azure Roles
  • Deployments of on-premise Topshelf services
  • Deployments to on-premise .NET web farms and off-premise Windows Azure web farms
  • Automatic creation of IIS websites

Maintainable, large-scale continuous delivery with TeamCity series

This post is part of a blog series jointly written by myself and Rob Moore called Maintainable, large-scale continuous delivery with TeamCity:

This post outlines how using OctopusDeploy for deployments can fit into a TeamCity continuous delivery deployment pipeline.

Maintainable, large-scale continuous delivery with TeamCity series

This post is part of a blog series jointly written by myself and Matt Davies called Maintainable, large-scale continuous delivery with TeamCity:

  1. Intro
  2. TeamCity deployment pipeline
  3. Deploying Web Applications
    • MsDeploy (onprem and Azure Web Sites)
    • OctopusDeploy (nuget)
    • Git push (Windows Azure Web Sites)
  4. Deploying Windows Services
    • MsDeploy
    • OctopusDeploy
    • Git push (Windows Azure Web Sites Web Jobs)
  5. Deploying Windows Azure Cloud Services
    • OctopusDeploy
    • PowerShell
  6. How to choose your deployment technology

Documentation

Some of the official documentation on the more advanced commands is not clear and doesn't warn you about some of the pitfalls that you can run into. Some good introductory links:

http://raquila.com/software/ms-deploy-basics/ (well organised post on the options and commands available)

http://blog.torresdal.net/2010/08/16/no-click-web-deployment-part-2-web-deploy-a-k-a-msdeploy/ (good for an extremely comprehensive listing of the various options available)

I've also linked to several posts below with information more targeted to specific msdeploy options and commands which are somewhat under-documented.

Our frequently used providers/options

We use the runCommand provider for remotely executing powershell scripts:

-presync:runCommand

-postsync:runCommand

-source:runCommand

Other useful commands/options:

 -useCheckSum

As the name implies, uses checksums to determine which files to sync as opposed to timestamps. In a continuous integration environment where every file is updated upon each build, this flag improves the speed of build times (and has helped us overcome locking issues with some unmanaged DLLs which are never updated).

-allowUntrusted

Useful if you trust your network and target msdeploy server (avoids certificate validation issues during deployment).

Powershell scripts and MSDeploy

Example teamcity parameters list for an msdeploy command line step:

-postSyncOnSuccess:runCommand='"powershell.exe -ExecutionPolicy ByPass -InputFormat None -NonInteractive -File C:msbuild_scriptstopshelf_deploy_postsync.ps1 -serviceAssemblyName %env.DeployServiceName% -environment %env.Env% -servicesPath %DeployTopshelfPath%"',waitInterval=600000

In the above powershell command:

  • ExecutionPolicy ByPass ensures powershell commands can be executed on systems where this is restricted by default (ie Windows Server 2008). See this post and the official documentation to learn more about this option.
  • InputFormat None is a little known option which overcomes an issue where msdeploy commands will hang if stdin is redirected (as is the case with msdeploy). See this bug report for more information and this blog post for the fix.
  • NonInteractive is another sparsely documented powershell option which ensures msdeploy can display the output from the powershell script.
  • waitInterval in an msdeploy option for runCommand which ensures the script waits a sufficient amount of time before giving up (the default value is only one second which is not long enough for some of our more complex powershell scripts.

Error handling in powershell

Our error handling process for powershell scripts executed via msdeploy is best described in the following posts: Powershell error handling and why you should care and Caught in a trap - dealing with errors. TeamCity is set up to fail a deployment build if the following text is detected from the msdeploy output: "exited with code '0x1'". We can ensure powershell scripts exit with that error by organising scripts with the following structure:

$ErrorActionPreference = "Stop"

try
{
    eventcreate /ID 1 /L APPLICATION /T INFORMATION /SO $websiteName /D "A deployment of $websiteName started"
}
catch
{
    write-error $_
    exit 1
}

If you're interested in reading more about msdeploy, you should check out Richard Szalay's posts on msdeploy - I've learnt a lot from them before!