It's 3:14am. Your phone is ringing. The primary replica just went offline — hardware fault, bad NIC, nobody knows yet. In a well-configured Always On Availability Group, the secondary is promoting itself right now. Failover is automatic, the application reconnects via the listener, and the on-call engineer's job is damage assessment, not emergency recovery. In a poorly configured AG, you're staring at a split-brain scenario with no quorum, a secondary that hasn't received the last 900 transactions, and a synchronization mode that was set to asynchronous because someone thought it was faster.
Always On Availability Groups are the most powerful high-availability feature SQL Server has shipped. They're also responsible for some of the most painful production incidents I've seen — because teams deploy them, check 'HA' off the list, and never think about them again until something breaks. This post is what I wish every client had read before their first real failover.
What Always On AGs Actually Do (and What They Don't)
An Availability Group is a set of user databases that fail over together as a unit. The group runs on two or more SQL Server instances — replicas — and one replica is designated primary at any given time. The primary accepts read/write connections. Secondary replicas receive a continuous stream of transaction log records from the primary and apply them to synchronized copies of the databases.
What AGs give you:
- Automatic failover — when configured correctly with a Windows Server Failover Cluster (WSFC) and synchronous replication, the secondary can promote automatically in seconds without data loss
- Readable secondaries — offload read-heavy reporting workloads to secondary replicas, reducing load on the primary
- Listener-based connectivity — applications connect to a virtual network name (the listener), not a specific server. Failover is transparent to the connection string
- Multi-subnet support — replicas in different physical sites or Azure regions for disaster recovery
What AGs don't give you: protection against data corruption, accidental deletes, or bad application logic. The secondary faithfully replicates everything the primary does — including a DROP TABLE run by mistake. That's what backups are for. AGs are not a backup strategy.
Synchronous vs. Asynchronous: The Decision That Matters Most
This is where most teams make their first — and worst — configuration mistake.
Synchronous commit means the primary transaction doesn't commit until at least one synchronous secondary has hardened the log record to disk and acknowledged it. You get zero data loss on failover. You pay for it in latency — every transaction waits for the round-trip to the secondary.
Asynchronous commit means the primary commits immediately and ships the log in the background. No latency impact. But on failover, the secondary may be behind — sometimes by seconds, sometimes by minutes depending on network throughput and transaction volume. That gap is data you've lost.
The right choice:
- Same datacenter or low-latency LAN: synchronous commit on all replicas. The round-trip is under 1ms. The latency hit is negligible. The zero-data-loss guarantee is worth it.
- Cross-datacenter or WAN replica for DR: asynchronous commit on the remote replica. You accept the RPO risk. Make sure stakeholders know the number.
- Never run synchronous commit to a replica with high network latency. I've seen primary databases grind to near-zero throughput because a synchronous secondary in a remote datacenter had 80ms round-trip.
Quorum: The Thing That Will Actually Split-Brain You
Always On AGs in Standard and Enterprise edition require a Windows Server Failover Cluster. The WSFC uses a quorum model — a majority of voting nodes must be online for the cluster to function. Get this wrong and your 'automatic failover' does nothing.
The critical rule: never run a two-node cluster without a quorum witness. Without a witness, each node is one vote. Lose one node and the surviving node has 50% — no majority, no cluster function.
Quorum witness options:
- File Share Witness — a network share on a third server. Simple, works well for on-prem two-node clusters.
- Cloud Witness — an Azure Storage blob acts as the witness. Works across datacenters. My recommendation for any cluster that needs to survive a site failure.
- Disk Witness — a shared disk in the cluster. Legacy option. Use file share or cloud witness instead.
After configuring your witness, test quorum by simulating node failures. Don't discover your witness is misconfigured during an actual outage.
Endpoint Configuration: The Silent Killer
Replicas communicate over dedicated database mirroring endpoints — TCP listeners on a specific port (default 5022). These endpoints must be:
- Listening on the correct IP and port
- Granted CONNECT permission to the service account of each replica
- Open through all firewalls between replicas
- Using certificate-based authentication if replicas are in different domains or workgroups
The endpoint permission problem is the most common silent failure I see. Check the SQL Server error log on the replica — it's almost always a login failed or permission denied on the mirroring endpoint. The fix is one line:
GRANT CONNECT ON ENDPOINT::Hadr_endpoint TO [domain\sqlagent_account];
Run it on every replica for every service account that needs to connect. Then verify connectivity from each replica to each other with a telnet or Test-NetConnection on port 5022 before you declare the AG healthy.
Seeding Mode: Automatic vs. Manual
When you add a database to an AG, the secondary needs an initial copy of the data. SQL Server 2016+ introduced automatic seeding, which streams the database directly from primary to secondary without requiring you to manually restore a backup. It's elegant for small-to-medium databases.
For large databases — multi-terabyte — think carefully. Automatic seeding streams over the mirroring endpoint at whatever speed your network allows. For large databases:
- Take a full backup (and log backup) of the primary
- Restore with
NORECOVERYon the secondary - Add the database to the AG using manual seeding
- The AG picks up synchronization from the log backup point
This adds steps but keeps your network traffic predictable and your production workload unaffected.
Readable Secondaries: What Nobody Tells You About the Version Store
Secondary replicas that accept read-only connections use a snapshot-based isolation mechanism. What teams don't realize: the version store that powers this lives in tempdb on the secondary. Heavy read workloads on the secondary generate version store activity in tempdb. If your secondary's tempdb is undersized or on slow storage, readable secondary performance will surprise you.
Monitor Version Store Size (KB) in sys.dm_os_performance_counters on your readable secondaries.
Pre-Deployment Checklist
Before you go live with an Always On AG, verify every item on this list:
- Synchronous commit configured for same-datacenter replicas; async documented and RPO accepted for remote replicas
- Quorum witness configured and tested
- Mirroring endpoints created on all replicas, correct port open through firewalls
- CONNECT granted on endpoints for all service accounts on all replicas
- Listener created with correct IP and subnet mask for each network
- Application connection strings updated to target the listener name, not individual server names
MultiSubnetFailover=Truein connection strings if replicas span multiple subnets- Backup preferences configured (prefer secondary for log backups)
- Initial seeding complete and all databases in Synchronized state
- Failover tested — manually initiated, verified application reconnected via listener, failed back
- Secondary tempdb sized appropriately if readable secondaries are enabled
- Monitoring configured for synchronization health, send/redo queue depths, and failover events
The last item is the one skipped most often. Alert on queue depths — a growing redo queue means the secondary is falling behind. That's your RPO degrading in real time.
Need AG Configuration Review?
A misconfigured Always On AG looks healthy right up until you need it. Our remote DBA team audits your AG configuration, quorum settings, and failover readiness — before the 3am call.
Get a Free Database AssessmentCommon Pitfalls in Production
Not testing failover before go-live. Configure the AG, then immediately initiate a manual failover. Verify the listener redirects correctly. Fail back. Do this in staging before you do it in production.
Connecting directly to replica names instead of the listener. Always connect through the listener. The listener is specifically designed to handle failover transparently.
Forgetting MultiSubnetFailover=True. In a multi-subnet AG, the listener has multiple IP addresses — one per subnet. Standard TCP connection behavior tries IPs sequentially, with a 21-second timeout between attempts. That's 21 seconds of application unavailability after a failover. MultiSubnetFailover=True makes the driver attempt all IPs in parallel.
Log backup jobs that break the secondary log chain. Use sys.fn_hadr_backup_is_preferred_replica() in your backup scripts to run backups only on the preferred replica.
Treating AGs as a substitute for monitoring. An AG in a 'Synchronized' state is healthy right now. It says nothing about what happens when a runaway query generates 50GB of transaction log in 10 minutes and the send queue explodes.
How ServerSide Technology Solutions Helps
Always On AG implementations that go wrong don't usually fail on configuration day. They fail six months later, at 3am, when a specific combination of quorum state, sync mode, and application behavior hits a scenario nobody planned for. The difference between a clean automatic failover and a multi-hour recovery effort is usually a few settings decisions made during initial deployment.
Our remote DBA team has designed and implemented Always On Availability Groups across dozens of SQL Server environments — two-node local clusters, multi-site DR configurations, and hybrid on-prem to Azure setups. We don't just configure the AG; we document the expected behavior for every failure scenario, test it, and make sure your team knows what to watch for.
If you're deploying Always On for the first time, upgrading from database mirroring, or inheriting an existing AG you're not confident in, our HA/DR service covers the full implementation and documentation. Start with a free database assessment to see where your current setup stands.