Azure High Availability and Performance Computing

Cloud platforms allow businesses to create highly available and performant computing infrastructure that stands the test of time and growth.

In a global market, sales are being made 24/7 and it is important that systems are able to handle unexpected performance requirements and minimize downtime.

Depending on the business and industry, it is estimated downtime can cost a company on average $336,000 per hour.

Before we discuss the high availability and high performance options from Azure, let’s first define the vocabulary:

High availability: The ability to continue functioning in the event of a hardware fault, server fault, or network outage, to have high uptime.

Resiliency: The ability to recover from a failure and continue to function, to avoid data loss.

Disaster recovery: The ability to recover if a major incident affects the services that host the application, such as a datacenter outage or complete regional outage.

Failover: In the event of an outage, data gets asynchronously replicated to a secondary region.

Eventual consistency: All transactions on the primary region eventually appear in the secondary region.

Availability zone: Unique physical locations within an Azure region that have independent power, cooling, and networking to increase availability. Usually there are 3 availability zones within a region.

Availability is measured in the Service Level Agreement (SLA) guaranteed by Azure. The higher the percentage, the more uptime and availability is guaranteed.

To calculate SLA of two different guarantees multiply the two SLAs together. 99.95% SLA * 99.9% = 99.85% overall.

Resiliency is measured in durability. The higher durability a service has, the less likely you are to experience data loss.

Storage

Azure offers four types of replication to make sure your data is available and consistent:

Locally redundant storage LRS: data stored on 3 different racks at same datacenter, protects node hardware failures, enabled by default.
- 99.9% SLA
- 99.999999999% (11 9’s) durability

Zone redundant storage ZRS: data replicated across 3 storage clusters in same region, different zones, cheaper than GRS.
- 99.9% SLA
- 99.9999999999% (12 9’s) durability

Geo-redundant storage GRS: stored 3 times in 2 different regions, most expensive and most durable.
- 99.9% SLA
- 99.99999999999999% (16 9’s) durability

Read-access geo redundant storage RGRS: read only access is provided in secondary region, gives you control to start failover.
- 99.99% SLA
- 99.99999999999999% (16 9’s)

At a minimum Azure Storage automatically maintains three copies of your data within the data center (LRS).

For highly available storage, GRS is generally recommended to mitigate region outages.

Computing

To make a server highly available and performant, you have two options for scaling a virtual machine:

Horizontal scaling: adding or removing more VMs to distribute the load. Best for expansion of work over time (scaling in and out).

Vertical scaling: adding resources such as memory, CPU, or storage to existing VMs by changing its plan. Requires rebooting and can change outgoing IP address. Best for increase of complexity. (up and down)

Virtual Machine Scale Sets

Scale Sets are groups of identically configured VMs that scale horizontally to demand, either automatically or on a customized schedule.

They enable central management, high availability, application resiliency, and are useful for distributed computing.

Scheduled scaling: proactively scale the VMs at a certain time

Requires a specific start and end time

Autoscaling: scale VMs automatically in and out when a resource experiences high usage.

Requires conditions, rules, and limits based on the VM’s metrics
- time grain: time span of the metric to measure, usually a minute
- time aggregation: calculated value of the metric during the time grain, such the average, minimum, maximum, total, last, and count
- duration: how long to measure time grains. Must be at least 5 minutes (=5 time grains)
The default scale condition is always active, but you can change its default instance count.
Scales have a cool down period so they aren’t repeatedly triggered.

Low-priority scale sets are underused compute resources offered by Azure at a lower price.

Temporary VMs that can be removed at any time by Azure.
- Delete: VM is deleted when resources are needed again.
- Dellocate: data is retained but you are charged even if it isn’t running

Availability Sets

Availability sets are discrete VMs with different configurations that perform similar functions.

While Scale Sets are groups of VMs that are automatically scaled with the same image or configuration, VMs in an availability set are scaled manually with different configurations.

Often times a load balancer is used to direct users to the correct VM in an availability set which further increases the availability. Learn more about load balancers.

Scale sets and availability sets allow you to spread VMs across multiple availability zones to make sure they have redundant power, cooling, and networking.

High Performance Computing

To vertically scale VMs, Azure offers many highly performant VMs that can respond to intensive memory, CPU, GPU, or storage requirements.

H-series VMs

Highly performant VMs with exceptional CPU, memory, and networking capabilities.

Intel Xeon E5-2667 v3 Haswell 3.2 GHz CPU with DDR4 memory, up to 16 cores and 224GB RAM (H16m SKU)

Remote Direct Memory Access (RDMA) network interfaces that don’t require involvement of the OS

100 GB/sec Mellanox EDR InfiniBand for data interconnect between VM storage systems

Message Passing Interface (MPI) support for greatest parallel communication between VMs

HB-series VMs: Highly performant VMs with exceptional memory capabilities.

60 AMD EPYC 7551 processor cores, with 4 GB of RAM per CPU core and 240 GB of memory

HC-series VMs: Highly performant VMs with CPU capabilities.

44 Intel Xeon Platinum 8168 processor cores, with 8 GB of RAM per CPU core and 352 GB of memory overall

N-series VMs

Highly performant VMs with exceptional GPU processing.

NC-series VMs: Lowest cost option of N-series VMs

NVIDIA Tesla K80 GPU card and Intel Xeon E5-2690 v3 processors

ND-series VMs: optimized for AI and deep learning, running single precision floating point operations.

NVIDIA Tesla P40 GPU card and Intel Xeon E5-2690 v4 processors

Recommendations

For highly available applications, you must account for failure within the design:

Region outages: Use geo-replication to mitigate regional outages. Uses an active region and a standby region.

Use a region pair, each region in a pair are updated separately and one has a priority to be rapidly recovered in an outage.

Gateway failure: Use Azure Front Door or Traffic Manager instead of an Application Gateway to handle failover and point to multiple app services.

Storage outage: Use one of the GRS options to replicate content to multiple regions.

Data: Use geo-replication on databases.

SQL Database (unmanaged) can use active geo-replication to automatically asynchronously replicate a DB to a read-only DB in another region.
SQL Database (managed) can use auto failover groups which is similar to active geo-replication, but allows failover automation.
Azure Cosmos DB is inherently multi-regional. All regions can be synchronously writable or you can enable read-only with automatic failover.

Application: Replicate App Services to secondary region and configure Azure Frontdoor to handle failover.

Make sure the web app doesn’t store session state in memory so that it won’t be purged.

Place read/write messages in a queue in storage so that tasks are saved. Use Azure Cache for Redis, this will optimize DB requests.

Azure DNS is inherently multi-regional with a 100% SLA.

Azure CDN and Azure AD are inherently multi-regional too with a 99.9% SLA.

When using RA-GRS:

Retry connections when experiencing transient failures.

Handle write failures in event of downtime. Print an error, buffer write operations, write to other storage account, or disable write operations and let user know it’s in read-only mode.
Handle stale data, as it gets replicated from primary to secondary region

Circuit Breaker pattern: after detecting a severe failure, the application failsover to the new region while also testing the primary region. Once the primary region comes online it reconnects. Does not work for local or in-memory data structures.

Azure changes their SLAs often so it’s important to check the official Azure documentation.