The call came at 2:47am. Primary replica was unreachable. The AG had not failed over automatically. The application was down. The secondary was sitting in a Resolving state, waiting — but for what? Forty minutes of troubleshooting later, the answer turned out to be a stale cluster DNS entry that had left the cloud witness unreachable. Quorum was lost. No quorum, no automatic failover. A two-line PowerShell fix, manual failover, application back up.
That scenario repeats across production environments more than it should. Always On Availability Groups are reliable when they're configured correctly and monitored actively. When something goes wrong, the failure modes are specific and diagnosable — but only if you know where to look. This post walks through the exact diagnostic sequence I follow when an AG won't fail over, when automatic failover fires unexpectedly, or when the new primary isn't behaving the way it should.
Step 1: Determine What State Your AG Is Actually In
Before you touch anything, get a clear picture of the current state. These two queries give you everything you need:
-- AG replica states and synchronization health
SELECT
ag.name AS ag_name,
ar.replica_server_name,
ars.role_desc,
ars.operational_state_desc,
ars.connected_state_desc,
ars.synchronization_health_desc,
ars.last_connect_error_description
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id
ORDER BY ag.name, ars.role_desc;
-- Database-level synchronization detail and queue depths
SELECT
ag.name AS ag_name,
ar.replica_server_name,
drs.database_id,
DB_NAME(drs.database_id) AS db_name,
drs.synchronization_state_desc,
drs.synchronization_health_desc,
drs.log_send_queue_size,
drs.redo_queue_size,
drs.last_hardened_lsn,
drs.last_redone_time
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id
ORDER BY ag.name, ar.replica_server_name;
Read these outputs carefully. The states that matter most:
- RESOLVING — the replica cannot communicate with the primary. It's waiting to determine whether it should promote. This is the state you see when automatic failover should have happened but didn't.
- NOT_SYNCHRONIZING at the database level — redo is paused or the secondary has stopped applying log. Often caused by disk pressure on the secondary or a database in an error state.
- last_connect_error_description — read this column carefully on any replica showing DISCONNECTED. It will frequently tell you exactly why connectivity failed.
Step 2: Check Cluster Quorum Before Anything Else
Automatic failover requires quorum. This is the single most common reason automatic failover silently fails — not a SQL Server problem at all, but a WSFC quorum problem. Run this from PowerShell on any cluster node:
Get-ClusterQuorum
Get-ClusterNode | Select-Object Name, State, NodeWeight
If quorum is lost, the surviving node will have the cluster service running but cluster resources will not come online. The AG sits in Resolving state indefinitely. To recover, you need to force quorum — but do this carefully:
# Force quorum on the surviving node — use ONLY when majority of nodes are offline
# and you have confirmed this is the most up-to-date node
Start-ClusterNode -Name "SQLNODE2" -FQ
The -FQ flag starts the cluster with forced quorum. After the cluster comes up, verify your witness is accessible and fix whatever caused the quorum loss before you touch the AG. If the cloud witness was unreachable because of a DNS/firewall change, fix that first. Otherwise you're one node restart away from the same situation.
Step 3: Why Didn't Automatic Failover Fire?
Automatic failover requires all of the following to be true simultaneously:
- Failover mode on both the primary and target secondary must be set to Automatic
- The secondary must be in synchronous commit mode with the primary
- All databases in the AG must be in SYNCHRONIZED state
- The cluster must have quorum
- The secondary must be able to determine the primary is genuinely unavailable (not just a transient network blip)
Check the failover mode configuration:
SELECT
ar.replica_server_name,
ar.availability_mode_desc,
ar.failover_mode_desc
FROM sys.availability_replicas ar
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id
WHERE ag.name = 'YourAGName';
If failover_mode_desc shows MANUAL on the secondary, automatic failover is disabled — full stop. Someone set it that way, probably during maintenance, and never changed it back. This is more common than it should be.
The fifth condition — determining genuine unavailability — is controlled by the health check timeout and lease timeout settings. By default, WSFC waits 10 seconds before declaring a node failed, then SQL Server's lease mechanism takes another 20 seconds. Total: up to 30 seconds before automatic failover initiates. This is intentional — it prevents flip-flopping on transient network issues. If your application can't tolerate 30 seconds of unavailability, you can tighten these values, but do it carefully and test thoroughly.
Step 4: Unexpected Automatic Failover — Finding the Root Cause
The AG failed over when you didn't expect it to. Now what? Start with the SQL Server error log on the former primary:
EXEC xp_readerrorlog 0, 1, 'availability', NULL, NULL, NULL, 'desc';
Look for entries around the failover timestamp. Common entries you'll find and what they mean:
- "Lease expired" — the primary's lease with the WSFC expired. Typically caused by a blocking condition on the primary (I/O stall, OS scheduler issue, memory pressure) that prevented the lease renewal thread from running. Check for I/O latency and CPU saturation at the time of the event.
- "Connection timeout" — the replicas lost network connectivity. Check the network team's event timeline.
- "IsAlive check failed" — WSFC sent a health check to the SQL Server resource and didn't get a response in time. Usually correlates with lease expiration scenarios.
- "Automatic failover initiated" followed by a database count — this confirms a clean automatic failover. The primary reported its own failure, WSFC agreed, secondary promoted.
Also check the Windows System event log for cluster events at the same timestamp. Cluster-side events are often more descriptive than the SQL Server error log for quorum and network-related failovers.
Step 5: After Failover — Validating the New Primary
Whether failover was automatic or manual, run these checks on the new primary before declaring it healthy:
-- Confirm primary role and all databases are online
SELECT
ars.role_desc,
DB_NAME(drs.database_id) AS db_name,
drs.synchronization_state_desc,
drs.is_local
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.dm_hadr_availability_replica_states ars ON drs.replica_id = ars.replica_id
WHERE ars.is_local = 1;
-- Verify listener is directing connections to the new primary
SELECT
dns_name AS listener_name,
port,
ip_configuration_string_from_cluster
FROM sys.availability_group_listeners
JOIN sys.availability_groups ON availability_group_listeners.group_id = availability_groups.group_id;
-- Check for any databases that did not come online cleanly
SELECT name, state_desc, is_in_standby
FROM sys.databases
WHERE replica_id IS NOT NULL OR is_in_standby = 1;
Two things to watch for immediately after failover:
- Databases in RESTORING state on the new primary — this happens when failover was forced with potential data loss. The databases need to be recovered:
RESTORE DATABASE [dbname] WITH RECOVERY. This is destructive — you're accepting the data loss. Confirm with stakeholders before running it. - The old primary coming back online — when the former primary recovers, it will join the AG as a secondary. Monitor it. If it starts synchronizing normally, you're in good shape. If it shows a data loss gap and you failed over with potential data loss, it will need to be re-seeded.
Step 6: Failing Back — Don't Skip This Step
After an unplanned failover, most teams leave the secondary as the primary and forget to fail back. That's a mistake. You want to restore the AG to its intended topology so your DR posture is intact.
Before failing back:
- Confirm the original primary is fully synchronized (
synchronization_state_desc = SYNCHRONIZED) - Confirm queue depths are zero (
log_send_queue_size = 0andredo_queue_size = 0) - Schedule a maintenance window — even a manual failover causes a brief connection reset
-- Initiate manual failover back to the original primary
-- Run this on the target server (the one you want to make primary)
ALTER AVAILABILITY GROUP [YourAGName] FAILOVER;
After failover completes, verify the listener, check all databases are synchronized on the new secondary (original primary), and update any monitoring alerts if the expected-primary server name is part of your alerting logic.
AG Failing Over Unexpectedly?
Unexpected failovers are a symptom — of lease timeouts, I/O pressure, quorum misconfiguration, or network issues. Our remote DBA team diagnoses the root cause, fixes it, and documents it so it doesn't happen again at 3am.
Get a Free Database AssessmentQuick Reference: AG Failover Won't Start
If you're in an active incident, run through this list:
- Check WSFC quorum —
Get-ClusterQuorumin PowerShell - Check
sys.dm_hadr_availability_replica_statesfor replica state and last error - Confirm failover mode is AUTOMATIC on both replicas
- Confirm availability mode is SYNCHRONOUS_COMMIT on the target secondary
- Check that all databases in the AG show SYNCHRONIZED state
- If all of the above are correct and AG still won't auto-fail: force a manual failover with
ALTER AVAILABILITY GROUP [name] FAILOVER - If databases are in RESOLVING state and won't come online:
ALTER AVAILABILITY GROUP [name] FORCE_FAILOVER_ALLOW_DATA_LOSS— use only as a last resort, this accepts potential data loss
Set Up Alerts Before the Next Incident
The best time to add AG monitoring is before you need it. These are the alerts worth configuring:
- Synchronization health != HEALTHY — catches replicas falling behind or disconnecting
- log_send_queue_size > 50MB — secondary is falling behind; RPO is degrading
- redo_queue_size > 50MB — secondary isn't applying log fast enough; often disk pressure
- Failover event — alert on any role change. You should never find out about a failover from an application team.
- Quorum state change — configure Windows cluster alerting for quorum loss events
SQL Server Agent alerts on sys.dm_hadr_availability_replica_states work well for most of these. Extended Events sessions targeting the hadr_* event channels give you the full detail when you need to investigate root cause after the fact.
When to Call for Help
Some AG failures resolve with the steps above. Others don't — because the root cause is a hardware issue, a network change nobody documented, or a configuration that's been silently wrong for months. If you've worked through the diagnostic sequence and the AG still isn't behaving correctly, the problem is usually one layer deeper than the SQL Server error log.
Our remote DBA team has seen most of the ways Always On AGs fail in production — lease timeout loops caused by antivirus scanning the SQL data files, cloud witness failures from expired storage account keys, split-brain scenarios from misconfigured subnets. We diagnose, fix, and document so the same failure doesn't repeat.
If your AG just failed over and you're not sure why, or if it's refusing to fail over when you need it to, our HA/DR service is built exactly for this. Start with a free database assessment — we'll review your AG configuration and tell you exactly where the risks are.