Troubleshooting REDO queue or Recovery queue build-up on Always On Readable Secondary Replica due to Blocked REDO thread


-> REDO Queue size has reached 12 GB and estimated data loss seems to be close to 22 Minutes.

-> Executed below query to check the REDO thread sessionid on Secondary replica,

SELECT db_name(database_id) as DBName,
session_id FROM sys.dm_exec_requests
WHERE command = 'DB STARTUP'

-> Checked on the secondary if there were any sessions blocking the redo thread,

select * from sysprocesses where blocked>0

-> It seems like SPID 172 is blocking the redo thread. SPID 172 is a select statement. I tried terminating spid 172, but was able to see another select statement taking up its place.

-> There were several Read-Only queries running on secondary replica and these select queries seems to be blocking the redo thread.

-> On the primary replica, Application team is executing a huge write query on a object called Table that is pushing the REDO thread on secondary replica to work hard and harden the data to database.

-> The redo thread on Secondary replica acquires schema modification (Sch-M) locks on the secondary database to make any DDL changes that are performed on primary replica. On the secondary replica, the read-only queries acquire schema stability (Sch-S) locks to process the query. In our case the redo thread is not able to acquire schema modification (Sch-M) locks as the select queries has already acquired schema stability (Sch-S) locks and this is the reason why we were experiencing blocking.

-> We decided to stop the Read-only queries on secondary replica as the operation on primary was more important and also dont want to get into a data loss situation in an event a failover is required.

-> Always on availability group is configured with “Readable Secondary” set to Yes,

-> We decided to change “Readable Secondary” to NO for time being till the REDO queue catches up.

-> Once above change was completed, as expected there were no select queries on secondary replica and the REDO thread started to move forward and there were no blockings.

-> Downside to this change is that all currently executing select queries on the secondary replica will be terminated and also any new read-only workload will have to utilize the primary replica and this could increase the load on Primary replica. All these risks were communicated with application team and a decision was taken to change “Readable Secondary” to NO.

-> Within 15 minutes Redo queue size was under control and we changed “Readable Secondary” back to YES as the write operation on Primary replica also completed fine.

Thank You,
Vivek Janakiraman

The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Leave a Reply