{"id":7316,"date":"2016-12-22T11:21:15","date_gmt":"2016-12-22T10:21:15","guid":{"rendered":"https:\/\/blog.redbaronofazure.com\/?p=7316"},"modified":"2016-12-22T12:19:40","modified_gmt":"2016-12-22T11:19:40","slug":"failover-management-with-azure-traffic-manager","status":"publish","type":"post","link":"https:\/\/blog.redbaronofazure.com\/?p=7316","title":{"rendered":"Failover management with Azure Traffic Manager"},"content":{"rendered":"<p>Achieving high availability with Azure AppServices that is datacenter resilient usually means you need to deploy your WebApp twice in to different Azure Regions (datacenters) and put Azure Traffic Manager infront of it. But\u00a0just load balancing your app between two datacenters doesn&#8217;t solve your problem, because you need to handle the failover between the two sites. Failover\u00a0becomes complex for state and persistent storage, like databases. This blob post is aimed towards giving you ideas on how you can handle this failover\u00a0management. It will be a bit lengthy, but I want to cover the complete scenario and I hope that reading it will be rewarding.<\/p>\n<p><strong>Deployment architecture<\/strong><\/p>\n<p>To set the stage, the architecture is as depicted in the figure below. It&#8217;s a simple WebApp deployed to the Azure Regions North Europe and West Europe.\u00a0The WebApp uses SQL Azure as a backend database and the databases are geo-replicated so that the db in North Europe is the primary allowing read\/Write and the db in West Europe is just Readable.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-00.png\"><img loading=\"lazy\" class=\"alignnone wp-image-7323\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-00.png\" alt=\"\" width=\"341\" height=\"314\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-00.png 401w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-00-300x276.png 300w\" sizes=\"(max-width: 341px) 100vw, 341px\" \/><\/a>\u00a0<a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-00B.png\"><img loading=\"lazy\" class=\"alignnone wp-image-7326\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-00B.png\" alt=\"\" width=\"340\" height=\"234\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-00B.png 386w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-00B-300x207.png 300w\" sizes=\"(max-width: 340px) 100vw, 340px\" \/><\/a><\/p>\n<p>Traffic Manager is set up to point to the two WebApp endpoints and with routing in Priority mode. This means that all traffic will be routed to the endpoint with the lowest priority as long as it is available. This makes North Europe our primary site and West Europe a backup site which will make the solution resilient towards a datacenter failure.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7327\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01.png\" alt=\"\" width=\"807\" height=\"453\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01.png 807w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01-300x168.png 300w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01-768x431.png 768w\" sizes=\"(max-width: 807px) 100vw, 807px\" \/><\/a><\/p>\n<p><strong>Traffic Manager Probing<\/strong><\/p>\n<p>How does Traffic Manager know when to fail over? TM probes each endpoint to see that it&#8217;s still there and as long as it returns an HTTP 200 status TM will keep the endpoint. You can configure what url should be used for probing. The default value for Path is &#8220;\/&#8221; which means the default page, but as you will see later, this is not an optimal choice. You want a page that returns the response as fast as possible and still does the necessary checks of its health.\u00a0I have a special webpage, probe.aspx, that handles the probe requests. I have also set a very low DNS TTL which means that TM will give the answer to which endpoint should be used with a very short time span. This is good in testing environment for validating the failover. In production you would have a higher value.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01B.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7329\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01B.png\" alt=\"\" width=\"823\" height=\"406\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01B.png 823w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01B-300x148.png 300w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-01B-768x379.png 768w\" sizes=\"(max-width: 823px) 100vw, 823px\" \/><\/a><\/p>\n<p>The probing process is described in the documentation\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/traffic-manager\/traffic-manager-monitoring\">https:\/\/docs.microsoft.com\/en-us\/azure\/traffic-manager\/traffic-manager-monitoring<\/a>\u00a0and I really encourage you to read it now before you move on in this blog post. There is one thing that the documentation doesn&#8217;t mention and that is that\u00a0you will be probed by multiple callers, as you will see below, which means there will be a whole lot more hitting your probe webpage. This is why you don&#8217;t want to use the default page as probe Path.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-02.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7318\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-02.png\" alt=\"\" width=\"1103\" height=\"629\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-02.png 1103w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-02-300x171.png 300w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-02-768x438.png 768w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-02-1024x584.png 1024w\" sizes=\"(max-width: 1103px) 100vw, 1103px\" \/><\/a><\/p>\n<p>In the above screenshot my WebApp is being probed by ip address 104.215.91.84 at seconds 20, 21, 50, 51 at a steady beat and by 65.52.217.19 at seconds 14, 14, 44, 44. This is 8 times a minute, which is quite alot. The pattern here is that it is probed 30 seconds apart, like 20-50, 21-51 and 14-44. You also can see the funny fact that by accident we happen to have to probes exactly using 14-44.<\/p>\n<p>Since we are being probed 8 times a second, I&#8217;ve implemented my probing logic in the code to do a database check not every time but just every 30 seconds, as can be seen in the Status column. The rest of the times we cache the db status and assume it hasn&#8217;t change since we don&#8217;t want to hit the db too much.<\/p>\n<p>But, the important thing is that the probe page actually returns HTTP Status 200 if the database is read-writeable (code does a dummy update in a table) which means we should be good.<\/p>\n<p><strong>Simulate a failure<\/strong><\/p>\n<p>I&#8217;ve created this WebApp so that I\u00a0can start returning HTTP Status 404 to the TM prober. If I pass query string statusCode=404 the probe logic will save that and return it until I tell it to start returning 200 again.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-03.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7319\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-03.png\" alt=\"\" width=\"631\" height=\"139\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-03.png 631w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-03-300x66.png 300w\" sizes=\"(max-width: 631px) 100vw, 631px\" \/><\/a><\/p>\n<p>Once\u00a0the probe page is set to return 404, you can see it\u00a0starting to return 404 errors.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-05.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7320\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-05.png\" alt=\"\" width=\"747\" height=\"550\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-05.png 747w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-05-300x221.png 300w\" sizes=\"(max-width: 747px) 100vw, 747px\" \/><\/a><\/p>\n<p>What happens now is explained in the documentation:<\/p>\n<p>&#8220;The monitoring system performs a GET request, but does not receive a response within the timeout period of 10 seconds (alternatively, a non-200 response may be received). It then tries three more times, at 30-second intervals. If one of the tries is successful, then the number of tries is reset.&#8221;<\/p>\n<p>This means that it takes 4&#215;30 seconds before Traffic Manager decides that the endpoint is not online anymore. We can see this by running a PowerShell script in a loop asking for the TM endpoints status. We have 4 calls where both endpoints have status &#8220;Online&#8221; and on the 5th call it has changed status to &#8220;Degraded&#8221;.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-06.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7321\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-06.png\" alt=\"\" width=\"732\" height=\"445\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-06.png 732w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-06-300x182.png 300w\" sizes=\"(max-width: 732px) 100vw, 732px\" \/><\/a><\/p>\n<p>What happens then is that all subsequent requests Traffic Manager gets to resolve which endpoint to use will go to the WebApp in West Europe.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-07.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7322\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-07.png\" alt=\"\" width=\"1101\" height=\"352\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-07.png 1101w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-07-300x96.png 300w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-07-768x246.png 768w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-07-1024x327.png 1024w\" sizes=\"(max-width: 1101px) 100vw, 1101px\" \/><\/a><\/p>\n<p>Current browser sessions to the faulting WebApp in North Europe have to time out and there is no redirection from North to West. Traffic Manager is basically a DNS extension to the Azure DNS and when you resolve the name xxxxx.trafficmanager.net, it will\u00a0return the ip address of the endpoints it has according to priority and status. This means that local DNS caches will play a trick on you as it will resolve the xxxxx.trafficmanager.net name to the ip adress of the faulting WebApp until the DNS cache times out.\u00a0You can see this behaviour in the last call in the PowerShell screenshot where the WebApp responding is still the one that failed (remember &#8211; I just simulate a 404 to the prober. The WebApp is still ok).<\/p>\n<p><strong>Strategies for handeling failover<\/strong><\/p>\n<p>OK, now you&#8217;ve seen the TM failover in action and understand how it works, but how should you implement your solution to handle it? My advice is to keep calm and not overreact and start a full failover, since the issue your WebApp is having might be a short glitch. If you implement a monitor like my short PowerShell script to watch the status of the TM endpoints and fire an alarm if anyone becomes &#8220;Degraded&#8221;, chances are that by the time an operator sees this event it has selfhealed, meaning the TM endpoint is &#8220;Online&#8221; again.<\/p>\n<pre class=\"theme:powershell-ise lang:ps decode:true\">Param(\r\n   [Parameter(Mandatory=$False)][string]$TrafficManagerName = \"\",            # TM name\r\n   [Parameter(Mandatory=$False)][string]$ResourceGroupName = \"\",             # Azure Resource Goup\r\n   [Parameter(Mandatory=$False)][int]$Sleep = 10                             # Sleep time in seconds between monitor calls\r\n)\r\n\r\nDo \r\n{\r\n    $html = Invoke-WebRequest \"http:\/\/$TrafficManagerName.trafficmanager.net\/whoareyou.aspx\" -Headers @{\"Cache-Control\"=\"no-cache\"}\r\n    write-output \"$(get-date -format 'o')\"\r\n    write-output \"Web site responding - $($html.Headers[\"WebSiteName\"])\"\r\n\r\n    $ep = Get-AzureRmTrafficManagerProfile -Name $TrafficManagerName -resourceGroupName $ResourceGroupName\r\n    foreach( $endpoint in $ep.Endpoints) {\r\n      if ( $endpoint.EndpointMonitorStatus -ne \"Online\") {\r\n        write-host \"$($endpoint.Priority). $($endpoint.Target) - $($endpoint.EndpointMonitorStatus) ($($endpoint.Location))\" -ForegroundColor Red\r\n      } else {\r\n        write-host \"$($endpoint.Priority). $($endpoint.Target) - $($endpoint.EndpointMonitorStatus) ($($endpoint.Location))\" \r\n      }\r\n    }\r\n    write-output \"\"\r\n    Sleep $Sleep\r\n} while(1 -eq 1)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Strategy 1 &#8211; survive the short glitch<\/strong><\/p>\n<p>To survive the short glitch you have two options. If your WebApp can live with a read-only database, the WebApp in West Europe can serve content from the West Europe database. You have to have some logic in your WebApp to know if it runs in\u00a0a read-write or read-only, because the SQL ConnectString needs to include &#8220;ApplicationIntent=readonly;&#8221;.<\/p>\n<p>If your app is not built for read-only mode, you have a second option and that is for the WebApp in West Europe (secondary region) to do connect to the database in North Europe (primary region) and cross-region database calls. This is not optimal from a performance aspect, but if the database is working in the primary region, this will make your WebApp have full functionlity during the short glitch. In the case of an outage you will not go from bad to worse so to speak.<\/p>\n<p><strong>Strategy 2 &#8211; survive a longer outage<\/strong><\/p>\n<p>Once you have identified that you have a glitch, \u00a0and that the described solution above has kicked in, the clock has started to tick for doing a database failover.\u00a0You need to make your own policy for how long you can live with that configuration. It is probably a human decision involved, but when you decide to fail over, you do two\u00a0things: Fail over the database and\u00a0reconfigure the connection strings in the WebbApps so that they switch roles.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-08.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7332\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-08.png\" alt=\"\" width=\"754\" height=\"246\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-08.png 754w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/12\/tm-failover-08-300x98.png 300w\" sizes=\"(max-width: 754px) 100vw, 754px\" \/><\/a><\/p>\n<p>Since you basically now have swapped the Primary and Secondary regions, you might consider changing the Priority of the Traffic Manager&#8217;s endpoints. Otherwise it will start to resolve the name to the North Europe WebApp as soon as it comes back online, which means you will be back to cross-region database calls again. The clock has now started ticking for making the decision on when its time to do the failback.<\/p>\n<p><strong>Summary<\/strong><\/p>\n<p>My primary objective in this post was to show you how Traffic Manager probing and failover works, how you can implement a probe webpage logic and how you can monitor the status of the TM endpoints. My secondary objective was to give you ideas on how to handle failover situations, since these situations are often associated with heated discussions and even perhaps are subject to scrutiny if your service have fullfilled its SLA and if your end-customer have the right to any kind of reimbursement. To minimise heated discussions and economical damage, you need to handle the failover as smooth as possible. This is what I mean with &#8211; keep calm, don&#8217;t overreact.<\/p>\n<p><strong>References<\/strong><\/p>\n<p>Documentation &#8211; Traffic Manager Endpoint Monitoring and Failover<br \/>\n<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/traffic-manager\/traffic-manager-monitoring\">https:\/\/docs.microsoft.com\/en-us\/azure\/traffic-manager\/traffic-manager-monitoring<\/a><\/p>\n<p>Documentation &#8211; Traffic Manager Overview<br \/>\n<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/traffic-manager\/traffic-manager-overview\">https:\/\/docs.microsoft.com\/en-us\/azure\/traffic-manager\/traffic-manager-overview<\/a><\/p>\n<p>Sample WebApp sources<br \/>\n<a href=\"https:\/\/github.com\/cljung\/AzTmFailover\">https:\/\/github.com\/cljung\/AzTmFailover<\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Achieving high availability with Azure AppServices that is datacenter resilient usually means you need to deploy your WebApp twice in to different Azure Regions (datacenters) and put Azure Traffic Manager infront of it. But\u00a0just load balancing your app between two datacenters doesn&#8217;t solve your problem, because you need to handle the failover between the two [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[231,201],"tags":[371],"_links":{"self":[{"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/posts\/7316"}],"collection":[{"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7316"}],"version-history":[{"count":8,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/posts\/7316\/revisions"}],"predecessor-version":[{"id":7337,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/posts\/7316\/revisions\/7337"}],"wp:attachment":[{"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7316"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7316"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7316"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}