Select Query on Secondary Replica getting blocked with wait type HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

Environment

-> Application team complained of simple select queries getting blocked. They advised that select queries that return a single row are running indefinitely.

-> These select queries are run against secondary replica. My first guess was that the REDO thread might be blocking these select queries.

-> I tried below query on secondary replica to see if there were any blockings and to my surprise there were no blockings.

select * from sysprocesses where blocked<>0

-> Application team advised that things were working fine before 30 minutes when they were executing the same select queries on primary replica. They are experiencing this issue when they started executing these select queries on secondary replica.

-> This is when I realized that “Readable Secondary” property was set to “Yes” recently as we wanted to move all Read-Only workloads to Secondary replica.

-> I tried executing a simple select that the application team is executing and was able to reproduce the behavior. My session_id was 60. I executed below query to check what this session was doing,

select DB_name(database_id) DBName, session_id,status, wait_type,last_wait_type,blocking_session_id 
from sys.dm_exec_requests where session_id =60

-> Wait type “HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING” happens when a query on object(s) in a readable secondary database of an Always On availability group is blocked on row versioning while waiting for commit or rollback of all transactions that were in-flight when the secondary replica was enabled for read workloads. This wait type guarantees that row versions are available before execution of a query under snapshot isolation. Please check this article for more details.

-> It is clear that the queries on secondary replica will be blocked when all active transactions active on primary replica at the time of enabling readable secondary have been committed and processed on the secondary replica.

-> I communicated this to the application team and understood from them that there are some active updates executing on Primary replica. I check the primary replica and was able to see the update statement active.

-> We waited for the update to complete and made sure the databases as part of Availability group are synchronized. After this we were able to execute the select statement without any issues.

-> From this instance we started enabling “Readable Secondary” during after business hours or during few/NO user activity.

Hope this helps!!!

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Troubleshooting REDO queue or Recovery queue build-up on Always On Readable Secondary Replica due to Blocked REDO thread

Environment

-> REDO Queue size has reached 12 GB and estimated data loss seems to be close to 22 Minutes.

-> Executed below query to check the REDO thread sessionid on Secondary replica,


SELECT db_name(database_id) as DBName,
session_id FROM sys.dm_exec_requests
WHERE command = 'DB STARTUP'
GO

-> Checked on the secondary if there were any sessions blocking the redo thread,

select * from sysprocesses where blocked>0

-> It seems like SPID 172 is blocking the redo thread. SPID 172 is a select statement. I tried terminating spid 172, but was able to see another select statement taking up its place.

-> There were several Read-Only queries running on secondary replica and these select queries seems to be blocking the redo thread.

-> On the primary replica, Application team is executing a huge write query on a object called Table that is pushing the REDO thread on secondary replica to work hard and harden the data to database.

-> The redo thread on Secondary replica acquires schema modification (Sch-M) locks on the secondary database to make any DDL changes that are performed on primary replica. On the secondary replica, the read-only queries acquire schema stability (Sch-S) locks to process the query. In our case the redo thread is not able to acquire schema modification (Sch-M) locks as the select queries has already acquired schema stability (Sch-S) locks and this is the reason why we were experiencing blocking.

-> We decided to stop the Read-only queries on secondary replica as the operation on primary was more important and also dont want to get into a data loss situation in an event a failover is required.

-> Always on availability group is configured with “Readable Secondary” set to Yes,

-> We decided to change “Readable Secondary” to NO for time being till the REDO queue catches up.

-> Once above change was completed, as expected there were no select queries on secondary replica and the REDO thread started to move forward and there were no blockings.

-> Downside to this change is that all currently executing select queries on the secondary replica will be terminated and also any new read-only workload will have to utilize the primary replica and this could increase the load on Primary replica. All these risks were communicated with application team and a decision was taken to change “Readable Secondary” to NO.

-> Within 15 minutes Redo queue size was under control and we changed “Readable Secondary” back to YES as the write operation on Primary replica also completed fine.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.