SQL Server Availability Group Monitoring

I needed to monitor the synchronization status of SQL Server Always On Availability Group databases — but with a constraint: the monitoring agent could only connect to a secondary replica, not the primary. This sounds like it should be simple. After all, the replica knows whether it’s in sync, right? Mostly, yes — but there’s a gotcha that cost me some time.

The goal Link to heading

The objective was straightforward: build a health check that answers two questions from the replica’s perspective:

Is each database healthy? Is it synchronized, synchronizing, or broken?
How far behind is the replica? If replication is lagging, by how much?

All the information lives in one DMV: sys.dm_hadr_database_replica_states. The challenge is knowing which columns return useful data when you’re querying from the secondary.

Step 1: Synchronization state & health Link to heading

The first question is easy to answer. These columns are populated on both the primary and secondary:

SELECT
    DB_NAME(database_id) as [name],
    [synchronization_state_desc],
    [synchronization_health_desc],
    [database_state_desc]
FROM
    sys.dm_hadr_database_replica_states;

name	synchronization_state_desc	synchronization_health_desc	database_state_desc
DB1	SYNCHRONIZED	HEALTHY	ONLINE
DB2	SYNCHRONIZING	PARTIALLY_HEALTHY	RECOVERY
DB3	NOT SYNCHRONIZING	NOT_HEALTHY	EMERGENCY

synchronization_state: Data-movement state — tells you if the replica is actively receiving and applying log records. SYNCHRONIZED means fully caught up; SYNCHRONIZING means actively receiving changes; NOT SYNCHRONIZING indicates data movement has stopped.
synchronization_health: A roll-up state that combines the synchronization state with the availability mode of the replica — useful for alerting. HEALTHY is the target; PARTIALLY_HEALTHY or NOT_HEALTHY warrant investigation.
database_state: The database state itself. Expect ONLINE under normal operation.

This tells you if something is wrong, but not how far behind the replica is. For that, you need timing data.

Step 2: Timing data Link to heading

The same DMV exposes several timestamps that track when log blocks move through the replication pipeline:

SELECT
    DB_NAME(database_id) as [name],
    [last_sent_time],
    [last_received_time],
    [last_commit_time]
FROM
    sys.dm_hadr_database_replica_states;

name	last_sent_time	last_received_time	last_commit_time
DB1	2022-06-15 11:47:34.317	2022-06-15 11:47:34.317	2022-06-15 11:05:26.687
DB2	2022-06-15 11:47:34.180	2022-06-15 11:47:34.180	2022-06-15 11:05:28.533
DB3	2022-06-15 11:47:35.020	2022-06-15 11:47:35.020	2022-06-15 11:47:34.827

last_sent_time: When the last log block was sent
last_received_time: When the log block ID in the last message received was read on the secondary replica
last_commit_time: The time corresponding to the last commit record. On a secondary replica, this trails the primary’s commit time by the amount of replication lag.

The key insight: last_received_time is populated on the secondary and tells you exactly when the replica last heard from the primary. By comparing it to GETDATE(), you can calculate how stale the replica is — without ever talking to the primary.

See the full column reference in the sys.dm_hadr_database_replica_states documentation.

The gotcha: secondary_lag_seconds Link to heading

Question

What about secondary_lag_seconds?

This was my first instinct. The documentation describes it as:

The number of seconds that the secondary replica is behind the primary replica during synchronization.

That sounds like the perfect metric. One column, already computed, measured in seconds. Except — and the documentation does mention this — the value is NULL on a secondary replica. It’s only populated when you query the DMV from the primary. If you can only connect to the replica, secondary_lag_seconds is useless.

Putting it all together Link to heading

With that dead end behind me, I combined the health check from Step 1 with the timing trick from Step 2 into a single query:

SELECT
    DB_NAME(database_id) as [name],
    synchronization_health_desc,
    DATEDIFF(MILLISECOND,last_received_time,GETDATE()) as [last_received_ms],
    [last_commit_time]
FROM
    sys.dm_hadr_database_replica_states;

This gives the monitoring agent everything it needs from a single connection to the replica:

name: Which database
synchronization_health_desc: Is the AG healthy — any value other than HEALTHY should trigger an alert
last_received_ms: How many milliseconds have passed since the replica last received data from the primary — a practical measure of replication lag. In practice, values under a few hundred milliseconds are normal for synchronous-commit mode; a steadily growing value indicates the replica is falling behind.
last_commit_time: When the last transaction was committed on the primary — useful for correlating with application activity

The last_received_ms calculation is the workhorse. Pair it with synchronization_health_desc and you have both a qualitative state and a quantitative lag metric — all from the secondary. This query is lightweight enough to poll on a short interval (e.g., every 30 seconds) and feed into your monitoring system of choice.