January | 2016 | redbaron of azure

This is the third post on how to make sure your application can survive a datacenter outage. The first post just set the basics on how to get your company’s website on to Azure. The second post focused on how to use Traffic Manager infront of a dual datacenter website deployment. This third post in the series will be on the topic on how the SQL Azure database can be geo-replicated across two datacenters.

If you haven’t read the DR2 post, I suggest you do this before reading this post.

Geo-replication is built in functionality in SQL Azure Database

First of all, there is an excellent article by Tony Petrossian in the SQL Azure product group (see references) that discusses the geo-replication features in SQL Azure and the difference between Active and Standard. I encourage you to read that if you haven’t.

In short, how geo-replication works in SQL Azure is that you have a primary database that continously sends commited transactions asynchronously to its secondary databases so that they are up to date. You can have multiple secondary databases but they can only be SQL Azure databases, so you can not user a SQL Server VM as the secondary.

Configuring the Geo-Replication is very easy in the portal. You just go to the primary database and select Geo-Replication in the Settings list. If you can’t find the menu item, you have the wrong Pricing Tier for the database. It needs to be Standard, so S0 is the lowest you can have to set this up.

If the Pricing Tier for the database is Standard, you can only do what is called Standard Geo-Replication, and the portal tells you that twice. In order to have a readable secondary database, the Pricing Tier must be Premium.

What is the Non-readable secondary type, then? It is basically an offline database that receives and applies the asynchronous transactions and keeps the database up to date, but it can not be used by applications in any way until you terminate the continous copy relationship.

Terminating the relationship is a manual procedure in the portal or something you do with Powershell. Either way you do it, it is not something Azure does automatically.

So the question is – should you switch to Premium or stay with Standard? Well, that depends on what you need the secondary site for and also how you plan to use Traffic Managers routing option.

The combinations of Traffic Manager’s Routing options and SQL Azure’s Geo-Repliction options

Traffic Manager has three options for Routing Method between the endpoints (websites) it has in its profile. The Routing Methods are Performance, Weighted(Round Robin) and Priority(Failover). The first two options will mean that all endpoints (websites) will be actively in use and serve web content to clients. This means that an offline, non-readable database is of little use with Performance or Weighted Traffic Manager routing and we must have Active Geo-Replication to get a readable database. But if the Traffic Manager routing is Priority(Failover), this means that we can get by with either the offline or the read-only as the secondary database. It becomes a question on how the failover should happen.

Time and decision to failover the database

If we assume we are using Routing Method Priority (Failover) for the Traffic Manager Profile, the switch from the primary to the secondary endpoints will happen automatically when TM’s monitoring probing fails on http(s). At that point monitoring status will become “Degraded” as I explained in my DR2 post and it will start to direct traffic to the secondary endpoint (ie the website in alternate Azure datacenter).

Since the continous copy relationship needs to be terminated regardless if we use an offline or read-only secondary database, this means that the actual database failover involves a decision point to execute the failover action. The choice of using offline or read-only as a secondary database is a choice of what should happen between the time of the TM failover and until the decision is made to terminate the continous copy relationship (which puts the secondary database in read-write mode).

Using offline as the secondary database means that the now active website will not be able to function correctly until the decision has been made to do database failover. The website responds, but you probably would call this a service disruption.
Using read-only as the secondary database means that the now active website may have a chance to function in read-only mode until the decision for failover has been made. How well the website behaves is really a question on what use read-only data is to your application. If this is a e-Commerce website, read-only is certainly not enough.

Manual decision when to failover

The decision to fail over is manual and should probably be done by a human so that a person can verify the state of the failed website. If you configure Traffic Manager with Priority (Failover) Routing Method, you may have the database connection string of the secondary website pointing to the primary database, since it may just be the website that failed in the primary datacenter and not a complete datacenter outage. In that case, you will have the odd configuration of cross datacenter database queries, but the web application will still be up and running. In this scenario, you may not even know that you had a problem since Traffic Manager will switch back when probing of the primary endpoint succeeds.

Failover action

In the portal, failover is a simple step and you do it via right-clicking on the secondary database and select Failover.

Failing over using Poweshell is even simpler since it’s one command – Set-AzureRmSqlDatabaseSecondary. The command is fast and it just takes seconds for the Geo-Replication role to change from Secondary to Primary.

Summary

This post have tried to show you how you can take advantage of SQL Azure’s built in Geo-Replication feature in order to avoid having a Disaster Recovery situation. Using Traffic Managers Priority (Failover) Routing Method and SQL Azure’s Geo-Replication feature, you can have a secondary website and database. Traffic Manager will switch website based on failed probing on port 80/443, but failing over the database is something you must plan and do as a manual step.

References

Azure SQL Database Standard Geo-Replication – Tony Petrossian
https://azure.microsoft.com/sv-se/blog/azure-sql-database-standard-geo-replication/

Azure SQL Database Geo-Replication Overview
https://azure.microsoft.com/sv-se/documentation/articles/sql-database-geo-replication-overview/

redbaron of azure

philosophical reflections on Azure (without a safety .net)

Month: January 2016

DR part 3 – SQL Azure Geo-Replication