In this blog post, we will discuss an issue encountered with an Always On Availability Group where a database remained stuck in the Initializing / In Recovery state on the secondary replica after a failover. We will explore the steps taken to resolve the problem and provide insights into the underlying causes.
During a failover from the primary to the secondary replica in an Always On Availability Group, one particular database entered the Initializing / In Recovery state on the secondary replica. Despite restarting SQL services and attempting to suspend data movement, the database remained stuck in this state.
- Logging off users and restarting SQL services: Initially, all users were logged off from the secondary replica, and the SQL services on the replica were restarted. However, these steps did not resolve the issue.
- Suspending data movement: The next approach involved suspending the data movement of the problematic database from the SQL Server Management Studio on the secondary replica. However, this operation resulted in a “lock request time out period exceeded” error.
- Identifying the blocking process: To investigate further, a query was executed to identify any blocking processes using the following query:
SELECT * FROM sysprocesses WHERE blocked > 0
It was discovered that the DB Startup thread for the specific database was blocking the SPID attempting to suspend data movement.
- Checking recovery status: The SQL Server error log was examined to determine if the database’s recovery had completed. The log indicated that recovery for the database had finished in 43 seconds, with no user action required.
Recovery completed for database JB_DB (database ID 7) in 43 second(s) (analysis 64 ms, redo 324 ms, undo 0 ms.) This is an informational message only. No user action is required.
- Analyzing DB Startup thread wait type: Monitoring the DB Startup thread revealed that it was in a wait state with the HADR_RECOVERY_WAIT_FOR_UNDO type. This wait type occurs when the secondary database is waiting for the reverting and initializing phases to synchronize with the primary database after failover.
- Monitoring undo progress: The “Log remaining for undo” performance monitor counter was used to track the amount of log in kilobytes remaining for the undo phase. Surprisingly, the counter showed 0, indicating that there was nothing to undo, and the recovery completed message in the SQL Server error log confirmed that the undo took 0 milliseconds.
- Failing over to another synchronous secondary replica: To further troubleshoot the issue, a failover was performed from the primary replica to another synchronous secondary replica. This time, all databases returned to a synchronized state without any issues.
- Restarting SQL Server Instance on Primary Replica if there are no additional synchronous secondary replica: If there are no additional synchronous secondary replica to failover the primary replica, you need to restart the SQL Server Instance on Primary Replica a a workaround for this issue.
The issue of a database stuck in the Initializing / In Recovery state after an Always On Availability Group failover was resolved by failing over to another synchronous secondary replica.
Although the root cause and exact timing of the issue remain unknown, this blog post aims to provide guidance and assistance to others who may encounter similar challenges within the community. This problem has been identified as a bug and is documented here.
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.