In this blog post, we will discuss an issue encountered with an Always On Availability Group where a database remained stuck in the Initializing / In Recovery state on the secondary replica after a failover. We will explore the steps taken to resolve the problem and provide insights into the underlying causes.
Issue
During a failover from the primary to the secondary replica in an Always On Availability Group, one particular database entered the Initializing / In Recovery state on the secondary replica. Despite restarting SQL services and attempting to suspend data movement, the database remained stuck in this state.
Resolution
Logging off users and restarting SQL services: Initially, all users were logged off from the secondary replica, and the SQL services on the replica were restarted. However, these steps did not resolve the issue.
Suspending data movement: The next approach involved suspending the data movement of the problematic database from the SQL Server Management Studio on the secondary replica. However, this operation resulted in a “lock request time out period exceeded” error.
Identifying the blocking process: To investigate further, a query was executed to identify any blocking processes using the following query: SELECT * FROM sysprocesses WHERE blocked > 0 It was discovered that the DB Startup thread for the specific database was blocking the SPID attempting to suspend data movement.
Checking recovery status: The SQL Server error log was examined to determine if the database’s recovery had completed. The log indicated that recovery for the database had finished in 43 seconds, with no user action required. Recovery completed for database JB_DB (database ID 7) in 43 second(s) (analysis 64 ms, redo 324 ms, undo 0 ms.) This is an informational message only. No user action is required.
Analyzing DB Startup thread wait type: Monitoring the DB Startup thread revealed that it was in a wait state with the HADR_RECOVERY_WAIT_FOR_UNDO type. This wait type occurs when the secondary database is waiting for the reverting and initializing phases to synchronize with the primary database after failover.
Monitoring undo progress: The “Log remaining for undo” performance monitor counter was used to track the amount of log in kilobytes remaining for the undo phase. Surprisingly, the counter showed 0, indicating that there was nothing to undo, and the recovery completed message in the SQL Server error log confirmed that the undo took 0 milliseconds.
Failing over to another synchronous secondary replica: To further troubleshoot the issue, a failover was performed from the primary replica to another synchronous secondary replica. This time, all databases returned to a synchronized state without any issues.
Summary
The issue of a database stuck in the Initializing / In Recovery state after an Always On Availability Group failover was resolved by failing over to another synchronous secondary replica. Although the root cause and exact timing of the issue remain unknown, this blog post aims to provide guidance and assistance to others who may encounter similar challenges within the community.
Thank You, Vivek Janakiraman
Disclaimer: The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.
-> Check this article for more details on high Log Send Queue on Asynchronous Replica.
-> Log Send Queue shows the amount of log records in the log files of the primary database that has not yet been sent to the secondary replica. This value is sent to the secondary replica from the primary replica. Queue size does not include FILESTREAM files that are sent to a secondary.
-> I faced this issue recently on a Synchronous replica. We are aware of the reason why the log send queue increased, that was due to an Alter Index job as part of weekly maintenance. But we wanted to understand if we can do to speed up the process. It all depends on the network link and if we have reached the maximum limit, then there is not much we can do here other than waiting.
-> I tried below actions,
Restart Always on Endpoints on Secondary and then on Primary
ALTER ENDPOINT endpoint_name STATE = STOPPED
ALTER ENDPOINT endpoint_name STATE = STARTED
-> After above action, Log send queue gradually decreased and Always On database came to sync in sometime.
Thank You, Vivek Janakiraman
Disclaimer: The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.
-> Availability group secondary database was “Not Synchronizing”.
-> Below is the view of Always On Dashboard,
-> SQL Server Error log
JBSAG1
2022-06-14 05:08:04.130 spid11s Always On: DebugTraceVarArgs AR ‘[HADR] [Primary] operation on replicas [CF81C5D4-0DBF-4573-AB9D-018423FB92B2]->[0E5B3832-3DDA-41CB-9730-0F119639B535], database [JBSAG12], remote endpoint [TCP://JBSAG2.JBSWIKI.com:5022], source operation [708829B5-E20D-4112-9D88-050797FE9D91]: Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED].’ 2022-06-14 05:08:04.130 spid44s A connection for availability group ‘JBAG12’ from availability replica ‘JBSAG1’ with id [CF81C5D4-0DBF-4573-AB9D-018423FB92B2] to ‘JBSAG2’ with id [0E5B3832-3DDA-41CB-9730-0F119639B535] has been successfully established. This is an informational message only. No user action is required. 2022-06-14 05:08:04.130 spid37s Always On Availability Groups connection with secondary database established for primary database ‘JBSAG12’ on the availability replica ‘JBSAG2’ with Replica ID: {0e5b3832-3dda-41cb-9730-0f119639b535}. This is an informational message only. No user action is required. 2022-06-14 05:08:04.130 spid68s Always On: DebugTraceVarArgs AR ‘[HADR] [Primary] operation on replicas [CF81C5D4-0DBF-4573-AB9D-018423FB92B2]->[0E5B3832-3DDA-41CB-9730-0F119639B535], database [JBSAG12], remote endpoint [TCP://JBSAG2.JBSWIKI.com:5022], source operation [708829B5-E20D-4112-9D88-050797FE9D91]: Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP].’ 2022-06-14 05:08:04.150 spid68s Always On: DebugTraceVarArgs AR ‘[HADR] [Primary] operation on replicas [CF81C5D4-0DBF-4573-AB9D-018423FB92B2]->[0E5B3832-3DDA-41CB-9730-0F119639B535], database [JBSAG12], remote endpoint [TCP://JBSAG2.JBSWIKI.com:5022], source operation [708829B5-E20D-4112-9D88-050797FE9D91]: Transitioning from [CATCHUP] to [COMPLETED].’ 2022-06-14 05:08:04.170 spid37s [DbMgrPartnerCommitPolicy::SetReplicaInfoInAg] DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: 0E5B3832-3DDA-41CB-9730-0F119639B535:6:4 2022-06-14 05:08:04.170 spid37s DbMgrPartnerCommitPolicy::SetSyncState: 0E5B3832-3DDA-41CB-9730-0F119639B535:6:4 2022-06-14 05:08:41.350 spid68s A connection timeout has occurred on a previously established connection to availability replica ‘JBSAG2’ with id [0E5B3832-3DDA-41CB-9730-0F119639B535]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role. 2022-06-14 05:08:41.350 spid68s Always On Availability Groups connection with secondary database terminated for primary database ‘JBSAG12’ on the availability replica ‘JBSAG2’ with Replica ID: {0e5b3832-3dda-41cb-9730-0f119639b535}. This is an informational message only. No user action is required. 2022-06-14 05:08:41.360 spid83s [DbMgrPartnerCommitPolicy::SetReplicaInfoInAg] DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: 0E5B3832-3DDA-41CB-9730-0F119639B535:6:1 2022-06-14 05:08:41.360 spid83s DbMgrPartnerCommitPolicy::SetSyncState: 0E5B3832-3DDA-41CB-9730-0F119639B535:6:1
JBSAG2
2022-06-14 05:07:07.030 spid76s A connection timeout has occurred on a previously established connection to availability replica ‘JBSAG1’ with id [CF81C5D4-0DBF-4573-AB9D-018423FB92B2]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role. 2022-06-14 05:07:07.030 spid76s Always On Availability Groups connection with primary database terminated for secondary database ‘JBSAG12’ on the availability replica ‘JBSAG1’ with Replica ID: {cf81c5d4-0dbf-4573-ab9d-018423fb92b2}. This is an informational message only. No user action is required. 2022-06-14 05:07:17.040 spid76s A connection timeout has occurred while attempting to establish a connection to availability replica ‘JBSAG1’ with id [CF81C5D4-0DBF-4573-AB9D-018423FB92B2]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance. 2022-06-14 05:08:04.130 spid31s A connection for availability group ‘JBAG12’ from availability replica ‘JBSAG2’ with id [0E5B3832-3DDA-41CB-9730-0F119639B535] to ‘JBSAG1’ with id [CF81C5D4-0DBF-4573-AB9D-018423FB92B2] has been successfully established. This is an informational message only. No user action is required. 2022-06-14 05:08:04.130 spid12s Always On: DebugTraceVarArgs AR ‘[HADR] [Secondary] operation on replicas [CF81C5D4-0DBF-4573-AB9D-018423FB92B2]->[0E5B3832-3DDA-41CB-9730-0F119639B535], database [JBSAG12], remote endpoint [TCP://JBSAG1.JBSWIKI.com:5022], source operation [708829B5-E20D-4112-9D88-050797FE9D91]: Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED].’ 2022-06-14 05:08:04.130 spid85s Error: 41145, Severity: 16, State: 1. 2022-06-14 05:08:04.130 spid85s Cannot join database ‘JBSAG12’ to availability group ‘JBAG12’. The database has already joined the availability group. This is an informational message. No user action is required. 2022-06-14 05:08:04.130 spid85s Always On: DebugTraceVarArgs AR ‘[HADR] [Secondary] operation on replicas [CF81C5D4-0DBF-4573-AB9D-018423FB92B2]->[0E5B3832-3DDA-41CB-9730-0F119639B535], database [JBSAG12], remote endpoint [TCP://JBSAG1.JBSWIKI.com:5022], source operation [708829B5-E20D-4112-9D88-050797FE9D91]: Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP].’ 2022-06-14 05:08:04.130 spid85s Always On: DebugTraceVarArgs AR ‘Processing BuildReplicaCatchup event with HADR Role: [SECONDARY]’ 2022-06-14 05:08:04.130 spid85s Always On: DebugTraceVarArgs AR ‘[HADR] [Secondary] operation on replicas [CF81C5D4-0DBF-4573-AB9D-018423FB92B2]->[0E5B3832-3DDA-41CB-9730-0F119639B535], database [JBSAG12], remote endpoint [TCP://JBSAG1.JBSWIKI.com:5022], source operation [708829B5-E20D-4112-9D88-050797FE9D91]: Transitioning from [CATCHUP] to [COMPLETED].’ 2022-06-14 05:08:04.130 spid31s Always On Availability Groups connection with primary database established for secondary database ‘JBSAG12’ on the availability replica ‘JBSAG1’ with Replica ID: {cf81c5d4-0dbf-4573-ab9d-018423fb92b2}. This is an informational message only. No user action is required. 2022-06-14 05:08:04.130 spid31s The recovery LSN (37:1200:1) was identified for the database with ID 6. This is an informational message only. No user action is required. 2022-06-14 05:08:32.050 spid52 The Database Mirroring endpoint has stopped listening for connections. 2022-06-14 05:08:34.070 spid52 The Database Mirroring endpoint is in disabled or stopped state. 2022-06-14 05:08:34.070 spid77s The connection between server instances ‘JBSAG2’ with id [0E5B3832-3DDA-41CB-9730-0F119639B535] and ‘JBSAG1’ with id [CF81C5D4-0DBF-4573-AB9D-018423FB92B2] has been disabled because the database mirroring endpoint was either disabled or stopped. Restart the endpoint by using the ALTER ENDPOINT Transact-SQL statement with STATE = STARTED. 2022-06-14 05:08:34.080 spid77s Always On Availability Groups connection with primary database terminated for secondary database ‘JBSAG12’ on the availability replica ‘JBSAG1’ with Replica ID: {cf81c5d4-0dbf-4573-ab9d-018423fb92b2}. This is an informational message only. No user action is required. 2022-06-14 05:09:09.370 spid84 ALTER DB param option: SUSPEND 2022-06-14 05:09:09.370 spid84 Always On Availability Groups data movement for database ‘JBSAG12’ has been suspended for the following reason: “user” (Source ID 0; Source string: ‘SUSPEND_FROM_USER’). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online. 2022-06-14 05:09:26.920 spid81 ALTER DB param option: RESUME 2022-06-14 05:09:26.920 spid81 Always On Availability Groups data movement for database ‘JBSAG12’ has been resumed. This is an informational message only. No user action is required.
-> From the SQL Server error log on secondary JBSAG2, we can see that mirroring endpoint is stopped.
-> Lets check the DMV’s for more details,
select r.replica_server_name, r.endpoint_url,
rs.connected_state_desc, rs.last_connect_error_description,
rs.last_connect_error_number, rs.last_connect_error_timestamp
from sys.dm_hadr_availability_replica_states rs join sys.availability_replicas r
on rs.replica_id=r.replica_id
JBSAG1
Connection attempt failed with error: ‘10060(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)’.
JBSAG2
The Database Mirroring endpoint is in disabled or stopped state.
-> Lets try starting the mirroring endpoint using below command on Secondary JBSAG2,
ALTER ENDPOINT [Hadr_endpoint] STATE = STARTED
-> Once above command completes, I can see databases on Availability group as synchronized,
Thank You, Vivek Janakiraman
Disclaimer: The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.