Environment

-> Application team advised that their application is not able to connect to the database that is part of Always on availability group.
-> Always on Availability group was in resolving state as shown below,

-> Always on Availability group roles were in pending state on the Cluster administrator,

-> Below were the messages found on event viewer,
Event ID: 41144
The local availability replica of availability group ‘JBSAG’ is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: 41005). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.
Event ID: 1205
The Cluster service failed to bring clustered role ‘JBSAG’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role. Event ID: 1069
Cluster resource ‘JBSAG’ of type ‘SQL Server Availability Group’ in clustered role ‘JBSAG’ failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Event ID: 7043
The Cluster Service service did not shut down properly after receiving a preshutdown control.
-> Below errors were there on SQL Server error log,
Error: 41022, Severity: 16, State: 0.
Failed to create a Windows Server Failover Clustering (WSFC) notification port with notification filter 778567686 and notification key 3 (Error code 5073). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the specified arguments are invalid. Otherwise, contact your primary support provider. For information about this error code, see “System Error Codes” in the Windows Development documentation.
Always On: The availability replica manager is going offline because the local Windows Server Failover Clustering (WSFC) node has lost quorum. This is an informational message only. No user action is required.
Always On: The local replica of availability group ‘JBSAG’ is stopping. This is an informational message only. No user action is required.
Error: 41066, Severity: 16, State: 0.
Cannot bring the Windows Server Failover Clustering (WSFC) resource (ID ‘ee50bbc1-93ab-4f25-85e5-a7d245555183’) online (Error code 126). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the WSFC resource may not be in a state that could accept the request. Otherwise, contact your primary support provider. For information about this error code, see “System Error Codes” in the Windows Development documentation.
Error: 41160, Severity: 16, State: 0.
Failed to designate the local availability replica of availability group ‘JBSAG’ as the primary replica. The operation encountered SQL Server error 41066 and has been terminated. Check the preceding error and the SQL Server error log for more details about the error and corrective actions. Error: 41017, Severity: 16, State: 1.
Failed to add a node to the possible owner list of a Windows Server Failover Clustering (WSFC) resource (Error code 5908). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the specified cluster resource or node handle is invalid. Otherwise, contact your primary support provider.
-> Cluster.log did not have much details other than the AG group failing.
-> All messages or errors were all generic. We could not get much clue.
-> We executed Process monitor and found that DBA team have renamed C:\windows\system32\hadrres.dll to hadrres_old.dll due to a patching error on both JBSAG1 and JBSAG2. They had to rename it as patching was failing with an error that hadrres.dll is used by another process. DBA team forgot to rename it back to hadrres.dll and this caused the issues. We renamed the file back to hadrres.dll and that solved the issue.


Watch this video to grasp a real-time understanding of this matter.
Thank You,
Vivek Janakiraman
Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.