-> I was working on an Always On failover issue. Always on Availability group was failing over everyday anytime between 22:00 to 23:00.
-> Below messages were found in Event viewer logs,

Log Name: Application
Source: MSSQL$SQL01
Date: 6/01/2020 10:29:21 PM
Event ID: 41144
Task Category: Server
Level: Error
Keywords: Classic
User: N/A
Computer: JBSERVER1.JBS.COM
Description:
The local availability replica of availability group ‘JBAG’ is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: 41005). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 6/01/2020 10:29:18 PM
Event ID: 1561
Task Category: Startup/Shutdown
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER1.JBS.COM
Description:
Cluster service failed to start because this node detected that it does not have the latest copy of cluster configuration data. Changes to the cluster occurred while this node was not in membership and as a result was not able to receive configuration data updates.
Guidance:
Attempt to start the cluster service on all nodes in the cluster so that nodes with the latest copy of the cluster configuration data can first form the cluster. This node will then be able join the cluster and will automatically obtain the updated cluster configuration data. If there are no nodes available with the latest copy of the cluster configuration data, run the ‘Start-ClusterNode -FQ’ Windows PowerShell cmdlet. Using the ForceQuorum (FQ) parameter will start the cluster service and mark this node’s copy of the cluster configuration data to be authoritative. Forcing quorum on a node with an outdated copy of the cluster database may result in cluster configuration changes that occurred while the node was not participating in the cluster to be lost.

Log Name: System
Source: Service Control Manager
Date: 6/01/2020 10:29:21 PM
Event ID: 7024
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: JBSERVER1.JBS.COM
Description:
The Cluster Service service terminated with the following service-specific error:
A quorum of cluster nodes was not present to form a cluster.

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 7/01/2020 11:45:47 AM
Event ID: 1146
Task Category: Resource Control Manager
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER2.JBS.COM
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 6/01/2020 10:28:25 PM
Event ID: 1135
Task Category: Node Mgr
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER2.JBS.COM
Description:
Cluster node ‘JBSERVER1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
-> Below messages were found in Cluster.log
[System] 00002420.00002004::2020/01/01-00:40:48.745 DBG Cluster node ‘JBSERVER3’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
[System] 00002420.00002004::2020/01/01-00:40:48.746 DBG Cluster node ‘JBSERVER2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
[System] 00002420.00004598::2020/01/01-00:40:48.809 DBG The Cluster service was halted to prevent an inconsistency within the failover cluster. The error code was ‘1359’.
[System] 00002420.0000438c::2020/01/01-00:40:49.173 DBG The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
[System] 00002420.00005e5c::2020/01/01-00:40:49.174 DBG The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
-> The messages indicate that the Always On Availability group failover may be due to a network issue. I requested help from my networking team and was advised that there were no network issues.
-> I configured verbose logging for Always On Availability group using this article and generated cluster.log when the issue happened next time.
-> I started a continuous ping with a timestamp embedded into it till the issue occurred next time using below powershell command. From JBSERVER1, I started pinging JBSERVER2, JBSERVER3, File share witness server. From JBSERVER2, I started pinging JBSERVER1, JBSERVER3, File share witness server. From JBSERVER3, I started pinging JBSERVER1, JBSERVER2, File share witness server.
ping.exe -t JBSERVER1|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER1.txt
ping.exe -t JBSERVER2|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER2.txt
ping.exe -t JBSERVER3|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER3.txt
-> The issue happened next day and below is the SQL Server error log details,
2020-01-06 22:28:16.580 spid22s The state of the local availability replica in availability group ‘JBAG’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the local instance of SQL Server is shutting down. For more information, see the SQL Server
2020-01-06 22:29:02.950 spid47s The state of the local availability replica in availability group ‘JBAG’ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For
-> I checked the ping results,

-> I provided these results to the Network team and requested the reason why there is a “Request timed out” if there are no network issues.
-> While the Network team was investigating I requested my Infrastructure team to check if the network card and firmware drivers were up to date. I got an update that they were latest.
-> I also wanted to ensure Anti-virus software is not a problem. Hence wanted to uninstall and verify. But this request was denied.
-> In the meantime, Application team requested for any temporary workaround or fix till the network team complete their troubleshooting.
-> I advised them that we can increase the values of below properties till we get to the root cause of network issue. I have clearly advised the application team that the default values for below properties are the recommended values and changing these to a higher value as recommended below can increase the RTO (Recovery Time Objective) as there will be a delay in failover in case of a genuine server/SQL down scenario. It just masks or delays the problem, but will never completely fix the issue. The best thing to do is find out the root cause of the heartbeat failures and get it fixed. Application team understood the risk and accepted to increase the values as it will be temporary.

PS C:\Windows\system32> (get-cluster).SameSubnetDelay = 2000
PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20
PS C:\Windows\system32> (get-cluster).CrossSubnetDelay = 4000
PS C:\Windows\system32> (get-cluster).CrossSubnetThreshold = 20
-> You can check what these values are before and after the change using below command,
PS C:\Windows\system32> get-cluster | fl *subnet*
-> This gave us some temporary relief. After 1 week, Infrastructure team have advised that there was a VM level backup happening at that time everyday through Commvault which may/can freeze the servers for 4 or 5 seconds. It seems like they have suspended it as it was not required anymore.
-> Same time, Network team advised that they have fixed the network issue and updated us to monitor.
-> I changed SameSubnetDelay, SameSubnetThreshold, CrossSubnetDelay, CrossSubnetThreshold to its default value. There were no issues after this. Everyone were happy!
Thank You,
Vivek Janakiraman
Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.