Always On – Availability group not failing over automatically

-> Client applications not able to connect to the listener.

-> Environment setup is as below,

Backup_Setup

-> When I try connecting to the Listener using SQL Server management Studio (SSMS), I get below error,

Blog1_0

TITLE: Connect to Server
——————————
Cannot connect to JBAPP.
——————————
ADDITIONAL INFORMATION:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 2)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=2&LinkId=20476
——————————
The system cannot find the file specified
——————————
BUTTONS:
OK
——————————

-> I tried connecting to Database Server JBAG1 and JBAG2 manually using SSMS. Connection to JBAG1 worked, But connection to JBAG2 failed with below error,

Blog1_2

TITLE: Connect to Server
——————————
Cannot connect to JBAG2\IN01.
——————————
ADDITIONAL INFORMATION:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 – The wait operation timed out.) (Microsoft SQL Server, Error: 258)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=258&LinkId=20476
——————————
The wait operation timed out
——————————
BUTTONS:
OK
——————————

-> I connected to JBAG1 and opened a new query windows to one of the user database and tried a select statement, it worked. But queries are failing with below error when performing insert, delete or update.

Blog1_1

Msg 3906, Level 16, State 2, Line 4
Failed to update database “JBDB” because the database is read-only.

-> Latency between primary and secondary datacentre is 2 ms. Hence the setup is a synchronous replica with automatic failover. Please note that this setup will be a bad design if latency between datacentres are more and if there are frequent network glitches.

-> I checked further and below were my observations,

  1. JBAG2 is the PRIMARY Replica and SQL Server was down on it.
  2. JBAG1 was the secondary replica. It seems like automatic failover did not happen.

-> Always On Availability group on JBAG1 was in resolving state,

Blog1_3

-> I started the SQL Services on JBAG2 and in some time everything started working fine including Always On Availability groups.

-> Now comes the question, why automatic failover did not happen?

-> I opened cluadmin.msc and opened “Cluster Events” and found below errors,

Blog1_4

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 4/11/2020 8:29:08 AM
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: JBAG2.JBS.COM
Description:
Cluster resource ‘JBAG’ of type ‘SQL Server Availability Group’ in clustered role ‘JBAG’ failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”&gt;
<System>
<Provider Name=”Microsoft-Windows-FailoverClustering” Guid=”{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}” />
<EventID>1069</EventID>
<Version>1</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime=”2020-04-11T02:59:08.867171300Z” />
<EventRecordID>2818</EventRecordID>
<Correlation />
<Execution ProcessID=”2464″ ThreadID=”3496″ />
<Channel>System</Channel>
<Computer>JBAG2.JBS.COM</Computer>
<Security UserID=”S-1-5-18″ />
</System>
<EventData>
<Data Name=”ResourceName”>JBAG</Data>
<Data Name=”ResourceGroup”>JBAG</Data>
<Data Name=”ResTypeDll”>SQL Server Availability Group</Data>
</EventData>
</Event>

Blog1_5

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 4/11/2020 8:29:09 AM
Event ID: 1254
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: JBAG2.JBS.COM
Description:
Clustered role ‘JBAG’ has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”&gt;
<System>
<Provider Name=”Microsoft-Windows-FailoverClustering” Guid=”{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}” />
<EventID>1254</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime=”2020-04-11T02:59:09.469214500Z” />
<EventRecordID>2822</EventRecordID>
<Correlation />
<Execution ProcessID=”2464″ ThreadID=”3496″ />
<Channel>System</Channel>
<Computer>JBAG2.JBS.COM</Computer>
<Security UserID=”S-1-5-18″ />
</System>
<EventData>
<Data Name=”ResourceGroup”>JBAG</Data>
</EventData>
</Event>

-> Above error shows that the failover did not happen since it reached the failover threshold. Checking cluster.log in JBAG2 to confirm this. Refer to this article if you want to know the command to generate cluster.log.

-> Cluster.log provides the same reason,

[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 INFO [RCM] Resource JBAG is causing group JBAG to failover.
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 INFO [RCM] rcm::RcmGroup::Failover: (JBAG)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::FailedDueToError=> (JBAG, 5963, false)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::UpdateAndGetFailoverCount=> (1, 2020/04/11-08:25:33.204)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::ComputeFailoverThreshold=> (JBAG, 1, computed)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 WARN [RCM] Not failing over group JBAG, failoverCount 2, failoverThresholdSetting 4294967295, lastFailover 2020/04/11-08:25:33.204
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::FailedDueToError=> (JBAG, 5963, false)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmResource::DelayedRestart(JBAG_192.168.0.45)

-> Increasing the Failover threshold will fix this issue.

-> Open cluster administrator (cluadmin.msc). Click on Roles. Right Click the ROLE and click properties,

Blog1_6

-> As per below settings, Only 1 failover is allowed in last 6 hours. This makes sense why automatic failover did not happen.

Blog1_7

-> In my case I changed the value from 1 to 5. This resolved my issue with automatic failover.

Blog1_8

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Designing High Availability and Disaster Recovery using SQL Server Standard Edition

-> Option 1

Option1

Option 1 utilizes standalone database server JBSERVER1 and JBSERVER2 participating in a basic availability group using synchronous commit with automatic failover for high availability in primary datacentre. Logshipping is configured between Primary Replica in Primary datacentre to JBSERVER3 in secondary datacentre for disaster recovery. DNS alias should be configured and used in application connection string and should be modified to JBSERVER3 in case of a disaster recovery, there will be no changes required for failover in primary datacentre as Listener will be utilized.

It should be noted that in basic availability group each availability group can contain only one database. In an event of some databases failing over to secondary replica and other databases residing on primary replica can result in issues. You can workaround this by failing over all availability group in an event of single availability group failover using this article.

-> Option2

Option2

Option 2 utilizes SQL Server installed on a failover cluster with shared storage. JBSERVER1 and JBSERVER2 will be the physical nodes with a virtual SQL Server installed and configured for high availability. Basic availability group using Asynchronous commit with manual failover will be configured between the virtual SQL Server in Primary datacentre to JBSERVER3 in secondary datacentre. DNS alias should be configured and used in application connection string and should be modified to JBSERVER3 in case of a disaster recovery, there will be no changes required for failover in primary datacentre as virtual SQL Server name will be utilized.

-> Option 3

Option3

Option 2 utilizes SQL Server installed on a failover cluster with shared storage. JBSERVER1 and JBSERVER2 will be the physical nodes with a virtual SQL Server installed and configured for high availability. Logshipping is configured between the virtual SQL Server in Primary datacentre to JBSERVER3 in secondary datacentre for disaster recovery. DNS alias should be configured and used in application connection string and should be modified to JBSERVER3 in case of a disaster recovery, there will be no changes required for failover in primary datacentre as virtual SQL Server name will be utilized.

It is prudent to test high availability and disaster recovery to make sure how long does it take and if it is acceptable to your business. It is mandatory that the design can achieve the required RTO and RPO.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Always On – Event ID: 1135 – Cluster node ‘Node’ was removed from the active failover cluster membership

-> I was working on an Always On failover issue. Always on Availability group was failing over everyday anytime between 22:00 to 23:00.

-> Below messages were found in Event viewer logs,

Blog71_1.PNG

Log Name: Application
Source: MSSQL$SQL01
Date: 6/01/2020 10:29:21 PM
Event ID: 41144
Task Category: Server
Level: Error
Keywords: Classic
User: N/A
Computer: JBSERVER1.JBS.COM
Description:
The local availability replica of availability group ‘JBAG’ is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: 41005). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.
Blog71_2.PNG
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 6/01/2020 10:29:18 PM
Event ID: 1561
Task Category: Startup/Shutdown
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER1.JBS.COM
Description:
Cluster service failed to start because this node detected that it does not have the latest copy of cluster configuration data. Changes to the cluster occurred while this node was not in membership and as a result was not able to receive configuration data updates.
Guidance:
Attempt to start the cluster service on all nodes in the cluster so that nodes with the latest copy of the cluster configuration data can first form the cluster. This node will then be able join the cluster and will automatically obtain the updated cluster configuration data. If there are no nodes available with the latest copy of the cluster configuration data, run the ‘Start-ClusterNode -FQ’ Windows PowerShell cmdlet. Using the ForceQuorum (FQ) parameter will start the cluster service and mark this node’s copy of the cluster configuration data to be authoritative. Forcing quorum on a node with an outdated copy of the cluster database may result in cluster configuration changes that occurred while the node was not participating in the cluster to be lost.

Blog71_3.PNG

Log Name: System
Source: Service Control Manager
Date: 6/01/2020 10:29:21 PM
Event ID: 7024
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: JBSERVER1.JBS.COM
Description:
The Cluster Service service terminated with the following service-specific error:
A quorum of cluster nodes was not present to form a cluster.

Blog71_4.PNG

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 7/01/2020 11:45:47 AM
Event ID: 1146
Task Category: Resource Control Manager
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER2.JBS.COM
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

Blog71_5.PNG

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 6/01/2020 10:28:25 PM
Event ID: 1135
Task Category: Node Mgr
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER2.JBS.COM
Description:
Cluster node ‘JBSERVER1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

-> Below messages were found in Cluster.log

[System] 00002420.00002004::2020/01/01-00:40:48.745 DBG Cluster node ‘JBSERVER3’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
[System] 00002420.00002004::2020/01/01-00:40:48.746 DBG Cluster node ‘JBSERVER2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
[System] 00002420.00004598::2020/01/01-00:40:48.809 DBG The Cluster service was halted to prevent an inconsistency within the failover cluster. The error code was ‘1359’.
[System] 00002420.0000438c::2020/01/01-00:40:49.173 DBG The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
[System] 00002420.00005e5c::2020/01/01-00:40:49.174 DBG The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

-> The messages indicate that the Always On Availability group failover may be due to a network issue. I requested help from my networking team and was advised that there were no network issues.

-> I configured verbose logging for Always On Availability group using this article and generated cluster.log when the issue happened next time.

-> I started a continuous ping with a timestamp embedded into it till the issue occurred next time using below powershell command. From JBSERVER1, I started pinging JBSERVER2, JBSERVER3, File share witness server. From JBSERVER2, I started pinging JBSERVER1, JBSERVER3, File share witness server. From JBSERVER3, I started pinging JBSERVER1, JBSERVER2, File share witness server. 


ping.exe -t JBSERVER1|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER1.txt


ping.exe -t JBSERVER2|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER2.txt


ping.exe -t JBSERVER3|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER3.txt

-> The issue happened next day and below is the SQL Server error log details,

2020-01-06 22:28:16.580 spid22s The state of the local availability replica in availability group ‘JBAG’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the local instance of SQL Server is shutting down. For more information, see the SQL Server
2020-01-06 22:29:02.950 spid47s The state of the local availability replica in availability group ‘JBAG’ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For

-> I checked the ping results,

Blog71_6.PNG

-> I provided these results to the Network team and requested the reason why there is a “Request timed out” if there are no network issues.

-> While the Network team was investigating I requested my Infrastructure team to check if the network card and firmware drivers were up to date. I got an update that they were latest.

-> I also wanted to ensure Anti-virus software is not a problem. Hence wanted to uninstall and verify. But this request was denied.

-> In the meantime, Application team requested for any temporary workaround or fix till the network team complete their troubleshooting.

-> I advised them that we can increase the values of below properties till we get to the root cause of network issue. I have clearly advised the application team that the default values for below properties are the recommended values and changing these to a higher value as recommended below can increase the RTO (Recovery Time Objective) as there will be a delay in failover in case of a genuine server/SQL down scenario. It just masks or delays the problem, but will never completely fix the issue. The best thing to do is find out the root cause of the heartbeat failures and get it fixed. Application team understood the risk and accepted to increase the values as it will be temporary.

Blog71_7.PNG

PS C:\Windows\system32> (get-cluster).SameSubnetDelay = 2000
PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20
PS C:\Windows\system32> (get-cluster).CrossSubnetDelay = 4000
PS C:\Windows\system32> (get-cluster).CrossSubnetThreshold = 20

-> You can check what these values are before and after the change using below command,

PS C:\Windows\system32> get-cluster | fl *subnet*

-> This gave us some temporary relief. After 1 week, Infrastructure team have advised that there was a VM level backup happening at that time everyday through Commvault which may/can freeze the servers for 4 or 5 seconds. It seems like they have suspended it as it was not required anymore.

-> Same time, Network team advised that they have fixed the network issue and updated us to monitor.

-> I changed SameSubnetDelay, SameSubnetThreshold, CrossSubnetDelay, CrossSubnetThreshold to its default value. There were no issues after this. Everyone were happy!

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.