Always ON Availability group in RESOLVING state

Environment

-> Application team advised that their application is not able to connect to the database that is part of Always on availability group.

-> Always on Availability group was in resolving state as shown below,

-> Always on Availability group roles were in pending state on the Cluster administrator,

-> Below were the messages found on event viewer,

Event ID: 41144
The local availability replica of availability group ‘JBSAG’ is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: 41005). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.

Event ID: 1205
The Cluster service failed to bring clustered role ‘JBSAG’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role. Event ID: 1069
Cluster resource ‘JBSAG’ of type ‘SQL Server Availability Group’ in clustered role ‘JBSAG’ failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Event ID: 7043
The Cluster Service service did not shut down properly after receiving a preshutdown control.

-> Below errors were there on SQL Server error log,

Error: 41022, Severity: 16, State: 0.
Failed to create a Windows Server Failover Clustering (WSFC) notification port with notification filter 778567686 and notification key 3 (Error code 5073). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the specified arguments are invalid. Otherwise, contact your primary support provider. For information about this error code, see “System Error Codes” in the Windows Development documentation.
Always On: The availability replica manager is going offline because the local Windows Server Failover Clustering (WSFC) node has lost quorum. This is an informational message only. No user action is required.
Always On: The local replica of availability group ‘JBSAG’ is stopping. This is an informational message only. No user action is required.

Error: 41066, Severity: 16, State: 0.
Cannot bring the Windows Server Failover Clustering (WSFC) resource (ID ‘ee50bbc1-93ab-4f25-85e5-a7d245555183’) online (Error code 126). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the WSFC resource may not be in a state that could accept the request. Otherwise, contact your primary support provider. For information about this error code, see “System Error Codes” in the Windows Development documentation.

Error: 41160, Severity: 16, State: 0.
Failed to designate the local availability replica of availability group ‘JBSAG’ as the primary replica. The operation encountered SQL Server error 41066 and has been terminated. Check the preceding error and the SQL Server error log for more details about the error and corrective actions. Error: 41017, Severity: 16, State: 1.
Failed to add a node to the possible owner list of a Windows Server Failover Clustering (WSFC) resource (Error code 5908). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the specified cluster resource or node handle is invalid. Otherwise, contact your primary support provider.

-> Cluster.log did not have much details other than the AG group failing.

-> All messages or errors were all generic. We could not get much clue.

-> We executed Process monitor and found that DBA team have renamed C:\windows\system32\hadrres.dll to hadrres_old.dll due to a patching error on both JBSAG1 and JBSAG2. They had to rename it as patching was failing with an error that hadrres.dll is used by another process. DBA team forgot to rename it back to hadrres.dll and this caused the issues. We renamed the file back to hadrres.dll and that solved the issue.

Watch this video to grasp a real-time understanding of this matter.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

ALWAYS ON – Troubleshooting Availability Group Database Stuck in Initializing / In Recovery Mode

Introduction

In this blog post, we will discuss an issue encountered with an Always On Availability Group where a database remained stuck in the Initializing / In Recovery state on the secondary replica after a failover. We will explore the steps taken to resolve the problem and provide insights into the underlying causes.

Issue

During a failover from the primary to the secondary replica in an Always On Availability Group, one particular database entered the Initializing / In Recovery state on the secondary replica. Despite restarting SQL services and attempting to suspend data movement, the database remained stuck in this state.

Resolution

  1. Logging off users and restarting SQL services: Initially, all users were logged off from the secondary replica, and the SQL services on the replica were restarted. However, these steps did not resolve the issue.
  2. Suspending data movement: The next approach involved suspending the data movement of the problematic database from the SQL Server Management Studio on the secondary replica. However, this operation resulted in a “lock request time out period exceeded” error.
  3. Identifying the blocking process: To investigate further, a query was executed to identify any blocking processes using the following query:
    SELECT * FROM sysprocesses WHERE blocked > 0
    It was discovered that the DB Startup thread for the specific database was blocking the SPID attempting to suspend data movement.
  4. Checking recovery status: The SQL Server error log was examined to determine if the database’s recovery had completed. The log indicated that recovery for the database had finished in 43 seconds, with no user action required.
    Recovery completed for database JB_DB (database ID 7) in 43 second(s) (analysis 64 ms, redo 324 ms, undo 0 ms.) This is an informational message only. No user action is required.
  5. Analyzing DB Startup thread wait type: Monitoring the DB Startup thread revealed that it was in a wait state with the HADR_RECOVERY_WAIT_FOR_UNDO type. This wait type occurs when the secondary database is waiting for the reverting and initializing phases to synchronize with the primary database after failover.
  6. Monitoring undo progress: The “Log remaining for undo” performance monitor counter was used to track the amount of log in kilobytes remaining for the undo phase. Surprisingly, the counter showed 0, indicating that there was nothing to undo, and the recovery completed message in the SQL Server error log confirmed that the undo took 0 milliseconds.
  7. Failing over to another synchronous secondary replica: To further troubleshoot the issue, a failover was performed from the primary replica to another synchronous secondary replica. This time, all databases returned to a synchronized state without any issues.
  8. Restarting SQL Server Instance on Primary Replica if there are no additional synchronous secondary replica: If there are no additional synchronous secondary replica to failover the primary replica, you need to restart the SQL Server Instance on Primary Replica a a workaround for this issue.

Summary

The issue of a database stuck in the Initializing / In Recovery state after an Always On Availability Group failover was resolved by failing over to another synchronous secondary replica. Although the root cause and exact timing of the issue remain unknown, this blog post aims to provide guidance and assistance to others who may encounter similar challenges within the community. This problem has been identified as a bug and is documented here.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

ALWAYS ON – GROWING LOG SEND QUEUE ON SYNCHRONOUS REPLICA

-> Check this article for more details on high Log Send Queue on Asynchronous Replica.

-> Log Send Queue shows the amount of log records in the log files of the primary database that has not yet been sent to the secondary replica. This value is sent to the secondary replica from the primary replica. Queue size does not include FILESTREAM files that are sent to a secondary.

-> I faced this issue recently on a Synchronous replica. We are aware of the reason why the log send queue increased, that was due to an Alter Index job as part of weekly maintenance. But we wanted to understand if we can do to speed up the process. It all depends on the network link and if we have reached the maximum limit, then there is not much we can do here other than waiting.

-> I tried below actions,

Restart Always on Endpoints on Secondary and then on Primary

ALTER ENDPOINT endpoint_name STATE = STOPPED
ALTER ENDPOINT endpoint_name STATE = STARTED

-> After above action, Log send queue gradually decreased and Always On database came to sync in sometime.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.