Always On – Availability group not failing over automatically

-> Client applications not able to connect to the listener.

-> Environment setup is as below,

Backup_Setup

-> When I try connecting to the Listener using SQL Server management Studio (SSMS), I get below error,

Blog1_0

TITLE: Connect to Server
——————————
Cannot connect to JBAPP.
——————————
ADDITIONAL INFORMATION:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 2)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=2&LinkId=20476
——————————
The system cannot find the file specified
——————————
BUTTONS:
OK
——————————

-> I tried connecting to Database Server JBAG1 and JBAG2 manually using SSMS. Connection to JBAG1 worked, But connection to JBAG2 failed with below error,

Blog1_2

TITLE: Connect to Server
——————————
Cannot connect to JBAG2\IN01.
——————————
ADDITIONAL INFORMATION:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 – The wait operation timed out.) (Microsoft SQL Server, Error: 258)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=258&LinkId=20476
——————————
The wait operation timed out
——————————
BUTTONS:
OK
——————————

-> I connected to JBAG1 and opened a new query windows to one of the user database and tried a select statement, it worked. But queries are failing with below error when performing insert, delete or update.

Blog1_1

Msg 3906, Level 16, State 2, Line 4
Failed to update database “JBDB” because the database is read-only.

-> Latency between primary and secondary datacentre is 2 ms. Hence the setup is a synchronous replica with automatic failover. Please note that this setup will be a bad design if latency between datacentres are more and if there are frequent network glitches.

-> I checked further and below were my observations,

  1. JBAG2 is the PRIMARY Replica and SQL Server was down on it.
  2. JBAG1 was the secondary replica. It seems like automatic failover did not happen.

-> Always On Availability group on JBAG1 was in resolving state,

Blog1_3

-> I started the SQL Services on JBAG2 and in some time everything started working fine including Always On Availability groups.

-> Now comes the question, why automatic failover did not happen?

-> I opened cluadmin.msc and opened “Cluster Events” and found below errors,

Blog1_4

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 4/11/2020 8:29:08 AM
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: JBAG2.JBS.COM
Description:
Cluster resource ‘JBAG’ of type ‘SQL Server Availability Group’ in clustered role ‘JBAG’ failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”&gt;
<System>
<Provider Name=”Microsoft-Windows-FailoverClustering” Guid=”{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}” />
<EventID>1069</EventID>
<Version>1</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime=”2020-04-11T02:59:08.867171300Z” />
<EventRecordID>2818</EventRecordID>
<Correlation />
<Execution ProcessID=”2464″ ThreadID=”3496″ />
<Channel>System</Channel>
<Computer>JBAG2.JBS.COM</Computer>
<Security UserID=”S-1-5-18″ />
</System>
<EventData>
<Data Name=”ResourceName”>JBAG</Data>
<Data Name=”ResourceGroup”>JBAG</Data>
<Data Name=”ResTypeDll”>SQL Server Availability Group</Data>
</EventData>
</Event>

Blog1_5

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 4/11/2020 8:29:09 AM
Event ID: 1254
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: JBAG2.JBS.COM
Description:
Clustered role ‘JBAG’ has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”&gt;
<System>
<Provider Name=”Microsoft-Windows-FailoverClustering” Guid=”{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}” />
<EventID>1254</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime=”2020-04-11T02:59:09.469214500Z” />
<EventRecordID>2822</EventRecordID>
<Correlation />
<Execution ProcessID=”2464″ ThreadID=”3496″ />
<Channel>System</Channel>
<Computer>JBAG2.JBS.COM</Computer>
<Security UserID=”S-1-5-18″ />
</System>
<EventData>
<Data Name=”ResourceGroup”>JBAG</Data>
</EventData>
</Event>

-> Above error shows that the failover did not happen since it reached the failover threshold. Checking cluster.log in JBAG2 to confirm this. Refer to this article if you want to know the command to generate cluster.log.

-> Cluster.log provides the same reason,

[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 INFO [RCM] Resource JBAG is causing group JBAG to failover.
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 INFO [RCM] rcm::RcmGroup::Failover: (JBAG)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::FailedDueToError=> (JBAG, 5963, false)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::UpdateAndGetFailoverCount=> (1, 2020/04/11-08:25:33.204)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::ComputeFailoverThreshold=> (JBAG, 1, computed)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 WARN [RCM] Not failing over group JBAG, failoverCount 2, failoverThresholdSetting 4294967295, lastFailover 2020/04/11-08:25:33.204
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::FailedDueToError=> (JBAG, 5963, false)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmResource::DelayedRestart(JBAG_192.168.0.45)

-> Increasing the Failover threshold will fix this issue.

-> Open cluster administrator (cluadmin.msc). Click on Roles. Right Click the ROLE and click properties,

Blog1_6

-> As per below settings, Only 1 failover is allowed in last 6 hours. This makes sense why automatic failover did not happen.

Blog1_7

-> In my case I changed the value from 1 to 5. This resolved my issue with automatic failover.

Blog1_8

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.