Always ON – The connection to the primary replica is not active. The command cannot be processed. (Microsoft SQL Server, Error: 35250)

Environment

-> I was trying to add a new database to availability group an got below error,

TITLE: Microsoft SQL Server Management Studio

Failed to join the database ‘JBDB’ to the availability group ‘JBSAG’ on the availability replica ‘JBSAG2’. (Microsoft.SqlServer.Management.HadrModel)

For help, click: https://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=16.100.37971.0&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&LinkId=20476

ADDITIONAL INFORMATION:
An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

The connection to the primary replica is not active. The command cannot be processed. (Microsoft SQL Server, Error: 35250)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&ProdVer=14.00.3294&EvtSrc=MSSQLServer&EvtID=35250&LinkId=20476

BUTTONS:
OK

-> Executing below code to check if endpoint is started on all nodes in SQLCMD mode,

:Connect JBSAG1
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go
:Connect JBSAG2
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go
:Connect JBSAG3
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go

-> Endpoint on JBSAG2 is down.

-> I have started it with below code in SQLCMD mode,

:Connect JBSAG2
alter endpoint [Hadr_endpoint] state = started
go

-> Adding database to Availability group works fine without issues after above fix.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Always ON Availability group in RESOLVING state

Environment

-> I tried failing over Availability group from JBSAG1 to JBSAG2.

-> The failover wizard started executing and was taking much time,

-> Below was the Availability group state,

-> It failed after some time with below message,

Manual Failover failed (Microsoft.SqlServer.Management.HadrTasks)

TITLE: Microsoft SQL Server Management Studio
ADDITIONAL INFORMATION:

Failed to perform a manual failover of the availability group ‘JBSAG’ to server instance ‘JBSAG2’. (Microsoft.SqlServer.Management.HadrModel)
For help, click: https://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=16.100.37971.0&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&LinkId=20476

An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

Failed to bring availability group ‘JBSAG’ online. The operation timed out. If this is a Windows Server Failover Clustering (WSFC) availability group, verify that the local WSFC node is online. Then verify that the availability group resource exists in the WSFC cluster. If the problem persists, you might need to drop the availability group and create it again. (Microsoft SQL Server, Error: 41131)

For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&ProdVer=14.00.3294&EvtSrc=MSSQLServer&EvtID=41131&LinkId=20476

BUTTONS:
OK

-> Checking SQL Server Errorlog,

JBSAG2

2020-07-25 16:02:03.210 spid81 The state of the local availability replica in availability group ‘JBSAG’ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_PENDING_FAILOVER’. The state changed because of a user initiated failover. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2020-07-25 16:02:03.790 Logon Error: 18456, Severity: 14, State: 5.
2020-07-25 16:02:03.790 Logon Login failed for user ‘JBS\JBSAG2$’. Reason: Could not find a login matching the name provided. [CLIENT: ]
2020-07-25 16:02:13.270 spid82s A connection timeout has occurred on a previously established connection to availability replica ‘JBSAG1’ with id [850DE16E-2C2B-43B1-BF52-6D8C5E2046FF]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
2020-07-25 16:08:58.730 spid81 The state of the local availability replica in availability group ‘JBSAG’ has changed from ‘RESOLVING_PENDING_FAILOVER’ to ‘RESOLVING_NORMAL’. The state changed because of a user initiated failover. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2020-07-25 16:08:58.730 spid83s Always On Availability Groups connection with primary database terminated for secondary database ‘JBSDBDW’ on the availability replica ‘JBSAG1’ with Replica ID: {850de16e-2c2b-43b1-bf52-6d8c5e2046ff}. This is an informational message only. No user action is required.
2020-07-25 16:08:58.730 spid25s Always On Availability Groups connection with primary database terminated for secondary database ‘StackOverflow2010’ on the availability replica ‘JBSAG1’ with Replica ID: {850de16e-2c2b-43b1-bf52-6d8c5e2046ff}. This is an informational message only. No user action is required.
2020-07-25 16:08:58.730 spid82s The availability group database “JBSDBDW” is changing roles from “SECONDARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2020-07-25 16:08:58.730 spid25s The availability group database “StackOverflow2010” is changing roles from “SECONDARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2020-07-25 16:08:58.740 spid25s The availability group database “JBSDBDW” is changing roles from “RESOLVING” to “SECONDARY” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2020-07-25 16:08:58.740 spid82s The availability group database “StackOverflow2010” is changing roles from “RESOLVING” to “SECONDARY” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

-> This will affect automatic failover also incase primary replica goes down.

-> Checked if NT Authority\System is created and has appropriate permission.

-> It seems like NT Authority\System was removed from SQL Server Instance.

-> Recreated NT Authority\System on SQL Server instance and provide required permission.

USE [master]
GO
CREATE LOGIN [NT AUTHORITY\SYSTEM] FROM WINDOWS WITH DEFAULT_DATABASE=[master]
GO 

GRANT ALTER ANY AVAILABILITY GROUP TO [NT AUTHORITY\SYSTEM]
GO
GRANT CONNECT SQL TO [NT AUTHORITY\SYSTEM]
GO
GRANT VIEW SERVER STATE TO [NT AUTHORITY\SYSTEM]
GO 

-> I tried the failover again after creating NT Authority\System and it worked fine,

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

A connection timeout has occurred while attempting to establish a connection to availability replica ‘Node’ with id []. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

Environment

-> Below message can be seen in the SQL Server errorlog that has Always ON Availability group configured.

A connection timeout has occurred while attempting to establish a connection to availability replica ‘JBSAG3’ with id [322C3054-0352-4F45-A2C2-9328C354051F]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

-> When a Session timeout occurs between two availability replicas, the availability replicas assume that a failure has occurred and declares a soft error.

-> Failure due to soft error that might cause timeouts are as below,

  • Network errors such as TCP link time-outs, dropped or corrupted packets, or packets that are in an incorrect order.
  • An operating system, server, or database that is not responding.
  • A Windows server timing out.
  • Insufficient computing resources, such as a CPU or disk overload, the transaction log filling up, or the system is running out of memory or threads. In these cases, you must increase the time-out period, reduce the workload, or change the hardware to handle the workload.

-> Above details with regards to soft error is copied from this article.

-> Session timeout value essentially protects SQL Server instance from soft errors by avoiding it to indefinitely wait for a response from other replicas. Always on Availability group uses session time out mechanism by sending out a ping to all connected replicas at a fixed interval. The Replicas ping each other to show that they are still active. If no ping is received from the other replica within the session timeout period, the connection times out and the replica enters the disconnected state.

-> Check the existing Session time out value. Right click Availability group and click properties,

-> The default value for Session timeout is 10 seconds.

-> This message below is logged in SQL Server Error log usually when no ping is received from other replica.

A connection timeout has occurred while attempting to establish a connection to availability replica ‘JBSAG3’ with id [322C3054-0352-4F45-A2C2-9328C354051F]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

-> To identify the root cause of this issue we can start a continuous ping with a timestamp embedded into it till the issue occurs next time using below powershell command. 

From JBSAG1, ping JBSAG2, JBSAG3. 
From JBSAG2, ping JBSAG1, JBSAG2. 
From JBSAG3, ping JBSAG1, JBSAG2. 

ping.exe -t JBSAG1|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSAG1.txt
ping.exe -t JBSAG2|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSAG2.txt
ping.exe -t JBSAG3|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSAG3.txt

-> When you check the output file after the issue occurred or message is evident in the SQL Server errorlog, you will be able to see a “request timed out” message in the output file.

-> With this details networking team can be contacted to understand the reason why there are ping drops.

-> It could be due to a temporary network issue or some VM level backups happening. Root cause and fix will be required from Networking team.

-> In the meantime, you can increase the session timeout value to a higher value. Increasing the value controls how many seconds that an availability replica waits for a ping response from a connected replica before considering the connection to have failed. By default it is 10 seconds.

-> I increased it to 60 till the network team came out with a solution. In my case it was a temporary ping drop, so 60 seconds value helped me.

-> How to change the session timeout value. Right click availability group and click properties,

-> Once the packet drop was identified (due to VM snapshot in my case) and fixed. I changed the session timeout value back to 10.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.