Always ON – Growing LOG SEND QUEUE on Asynchronous replica

Environment

-> Database server JBSAG2 and JBSAG1 are on primary datacentre located in East US.

-> Database server JBSAG3 is on secondary datacentre in Korea Central.

-> Database servers JBSAG1, JBSAG2 and JBSAG3 are part of failover cluster without shared storage.

-> The Log send queue on Asynchronous replica JBSAG3 was continuously growing as shown in below screenshot,

-> It was very clear that there was no issue on the Synchronous replica JBSAG1, as the log send queue was well under control.

-> On the other hand log send queue is huge on Asynchronous replica JBSAG3.

-> I have captured perfmon on database servers JBSAG1, JBSAG2 and JBSAG3. Below are the analysis,

Primary Replica -> JBSAG2

1) Primary JBSAG2 to its Log file

(SQLServer:Databases – Log Bytes Flushed/sec)  refers to how much primary is flushing per second to the log file. In this scenario, we are able to see that for databases JBDB, JBSDBDW and StackOverflow2010 Primary is flushing close to 5+ MB/sec maximum and 1 MB/sec Average to the log file.
We will have to sum Log Bytes Flushed/sec for each database to come to a final value.

2) Primary JBSAG2 to JBSAG1 via network

(SQLServer:Availability Replica -> Bytes Sent to Replica/Sec) refers to how much is sent across the network to the secondary replica. In our case we are able to see that maximum of 5+ MB/sec and 1+ MB/sec on an average sent to secondary JBSAG1 from Primary JBSAG2. This is same as SQLServer:Databases – Log Bytes Flushed/sec which is good.

3) Primary JBSAG2 to JBSAG3 via network

(SQLServer:Availability Replica -> Bytes Sent to Replica/Sec) refers to how much is sent across the network to the secondary replica. In our case we are able to see that maximum of 600+ KB/sec and 400+ KB/sec on an average sent to secondary JBSAG3 from Primary JBSAG2.

It is pretty clear that Secondary JBSAG1 receives data to the tune of 5+ MB/sec which is close to what the primary flushes, whereas Secondary JBSAG3 receives only 600 + KB/sec which is comparably very less.

Secondary Replica -> JBSAG1

1) Secondary JBSAG1 Received from Primary JBSAG2

(SQLServer:Database Replica -> Log Bytes Received/Sec) refers to how much data we receive in Secondary from Primary replica. In our case we are able to see that maximum of 5+ MB/sec and 500+ KB/sec on an average sent to secondary JBSAG1 from Primary JBSAG2. We will have to sum Log Bytes Received/sec for each database to come to a final value.

This is same as SQLServer:Databases – Log Bytes Flushed/sec in Primary, which is good.

2) Redoing received data on Secondary

(SQLServer:Database Replica->Redone Bytes/Sec) refers to how much data will be redone on secondary. In our case we are able to see that maximum of 5+ MB/sec and 500+ KB/sec on an average was redone on secondary JBSAG1. We will have to sum Redone Bytes/sec for each database to come to a final value.

This is same as Log Bytes Received/sec which is good.

Secondary Replica -> JBSAG3

1) Secondary JBSAG3 Received from Primary JBSAG2

(SQLServer:Database Replica -> Log Bytes Received/Sec) refers to how much data we receive in Secondary from Primary replica. In our case we are able to see that maximum of 1+ MB/sec and 200+ KB/sec on an average sent to secondary JBSAG3 from Primary JBSAG2. We will have to sum Log Bytes Received/sec for each database to come to a final value.

This is much lower than what we sent from primary as marked as SQLServer:Databases – Log Bytes Flushed/sec in Primary. We sent 5 MB/sec, but we received just 1 MB/sec. So there is a lag of 4+ MB/sec which is not good.

2) Redoing received data on Secondary

(SQLServer:Database Replica->Redone Bytes/Sec) refers to how much data will be redone on secondary. In our case we are able to see that maximum of 1+ MB/sec and 200+ KB/sec on an average was redone on secondary JBSAG3. We will have to sum Redone Bytes/sec for each database to come to a final value. We are redoing what we received from the primary. Redone Bytes/Sec is same as Log Bytes Received/Sec.

From the above analysis we understand below,

JBSAG2 -> Primary flushes 5 MB/sec to log file -> Primary sends 5 MB/sec to secondary via network.

JBSAG1 -> Secondary receives 5 MB/sec via network -> Secondary redo is also around 5 MB/sec -> Log send queue is around 120+ KB.

JBSAG3 -> Secondary receives 1 MB/sec via network -> Secondary redo is also around 1 MB/sec -> Log send queue is around 750+ MB.

From the above analysis it is clear that Primary JBSAG2 is able to send 5 MB/sec to JBSAG1. but only 1 MB/sec to JBSAG3 and this results in Log send queue buildup in JBSAG3.

-> It seems like it is a network issue between primary and secondary datacentre.

-> We tried a robocopy from JBSAG2 to JBSAG1 and JBSAG3 using a 10 GB single file.

robocopy c:\temp\ \\jbsag1\temp file.zip
robocopy c:\temp\ \\jbsag3\temp file.zip

-> Robocopy indicated speed of 23 MB/sec for copy between JBSAG2 to JBSAG1.

-> Robocopy exhibited speed of 1 MB/sec for copy between JBSAG2 to JBSAG3.

-> Robocopy tests were performed just to rule out any SQL Server issue and to advise network team that a normal copy operation also performs badly on JBSAG3.

-> I involved network and Platforms team to check this further.

-> Network team advised that the network link seems normal and there is no bandwidth issue. Network team advised Platforms team to check this further as it seems like a guest machine issue.

-> In the meantime, we found a different 3 node Always on setup same as this setup with same primary and secondary datacentre. The other 3 node Always On setup did not have this issue. We found that the Asynchronous secondary server in working setup was on a different host machine in the secondary datacentre than the problematic Asynchronous secondary server where the problem is visible.

-> Platforms team migrated the problematic Asynchronous secondary server to the host machine where the working setup’s Asynchronous secondary server was present.

-> This did not resolve the issue. Platforms team took a netsh trace from JBSAG2 to JBSAG1, JBSAG3.

-> The traces showed that during the three way handshake, the TCP window sizes are negotiated as no scaling factor negotiated. This lead to the suspicion with the TCP global parameters for auto tuning could be a cause of this issue. By default, this setting is enabled.

-> We checked this parameter using below command on servers JBSAG1, JBSAG2 and JBSAG3.

netsh interface tcp show global

-> TCP global parameter for auto tuning was in disabled state on JBSAG1 and JBSAG3.

-> TCP global parameter for auto tuning was enabled along with RSS using below command on JBSAG1 and JBSAG3,

netsh interface tcp set global rss=enabled autotuninglevel=Normal

-> Once these were enabled the transfer speeds increased and the TCP window size was negotiated to the highest.

-> This resolved our issue and the Log send queue size on JBSAG3 came under control.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Always ON – The connection to the primary replica is not active. The command cannot be processed. (Microsoft SQL Server, Error: 35250)

Environment

-> I was trying to add a new database to availability group an got below error,

TITLE: Microsoft SQL Server Management Studio

Failed to join the database ‘JBDB’ to the availability group ‘JBSAG’ on the availability replica ‘JBSAG2’. (Microsoft.SqlServer.Management.HadrModel)

For help, click: https://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=16.100.37971.0&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&LinkId=20476

ADDITIONAL INFORMATION:
An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

The connection to the primary replica is not active. The command cannot be processed. (Microsoft SQL Server, Error: 35250)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&ProdVer=14.00.3294&EvtSrc=MSSQLServer&EvtID=35250&LinkId=20476

BUTTONS:
OK

-> Executing below code to check if endpoint is started on all nodes in SQLCMD mode,

:Connect JBSAG1
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go
:Connect JBSAG2
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go
:Connect JBSAG3
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go

-> Endpoint on JBSAG2 is down.

-> I have started it with below code in SQLCMD mode,

:Connect JBSAG2
alter endpoint [Hadr_endpoint] state = started
go

-> Adding database to Availability group works fine without issues after above fix.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Always ON Availability group in RESOLVING state

Environment

-> I tried failing over Availability group from JBSAG1 to JBSAG2.

-> The failover wizard started executing and was taking much time,

-> Below was the Availability group state,

-> It failed after some time with below message,

Manual Failover failed (Microsoft.SqlServer.Management.HadrTasks)

TITLE: Microsoft SQL Server Management Studio
ADDITIONAL INFORMATION:

Failed to perform a manual failover of the availability group ‘JBSAG’ to server instance ‘JBSAG2’. (Microsoft.SqlServer.Management.HadrModel)
For help, click: https://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=16.100.37971.0&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&LinkId=20476

An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

Failed to bring availability group ‘JBSAG’ online. The operation timed out. If this is a Windows Server Failover Clustering (WSFC) availability group, verify that the local WSFC node is online. Then verify that the availability group resource exists in the WSFC cluster. If the problem persists, you might need to drop the availability group and create it again. (Microsoft SQL Server, Error: 41131)

For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&ProdVer=14.00.3294&EvtSrc=MSSQLServer&EvtID=41131&LinkId=20476

BUTTONS:
OK

-> Checking SQL Server Errorlog,

JBSAG2

2020-07-25 16:02:03.210 spid81 The state of the local availability replica in availability group ‘JBSAG’ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_PENDING_FAILOVER’. The state changed because of a user initiated failover. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2020-07-25 16:02:03.790 Logon Error: 18456, Severity: 14, State: 5.
2020-07-25 16:02:03.790 Logon Login failed for user ‘JBS\JBSAG2$’. Reason: Could not find a login matching the name provided. [CLIENT: ]
2020-07-25 16:02:13.270 spid82s A connection timeout has occurred on a previously established connection to availability replica ‘JBSAG1’ with id [850DE16E-2C2B-43B1-BF52-6D8C5E2046FF]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
2020-07-25 16:08:58.730 spid81 The state of the local availability replica in availability group ‘JBSAG’ has changed from ‘RESOLVING_PENDING_FAILOVER’ to ‘RESOLVING_NORMAL’. The state changed because of a user initiated failover. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2020-07-25 16:08:58.730 spid83s Always On Availability Groups connection with primary database terminated for secondary database ‘JBSDBDW’ on the availability replica ‘JBSAG1’ with Replica ID: {850de16e-2c2b-43b1-bf52-6d8c5e2046ff}. This is an informational message only. No user action is required.
2020-07-25 16:08:58.730 spid25s Always On Availability Groups connection with primary database terminated for secondary database ‘StackOverflow2010’ on the availability replica ‘JBSAG1’ with Replica ID: {850de16e-2c2b-43b1-bf52-6d8c5e2046ff}. This is an informational message only. No user action is required.
2020-07-25 16:08:58.730 spid82s The availability group database “JBSDBDW” is changing roles from “SECONDARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2020-07-25 16:08:58.730 spid25s The availability group database “StackOverflow2010” is changing roles from “SECONDARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2020-07-25 16:08:58.740 spid25s The availability group database “JBSDBDW” is changing roles from “RESOLVING” to “SECONDARY” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2020-07-25 16:08:58.740 spid82s The availability group database “StackOverflow2010” is changing roles from “RESOLVING” to “SECONDARY” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

-> This will affect automatic failover also incase primary replica goes down.

-> Checked if NT Authority\System is created and has appropriate permission.

-> It seems like NT Authority\System was removed from SQL Server Instance.

-> Recreated NT Authority\System on SQL Server instance and provide required permission.

USE [master]
GO
CREATE LOGIN [NT AUTHORITY\SYSTEM] FROM WINDOWS WITH DEFAULT_DATABASE=[master]
GO 

GRANT ALTER ANY AVAILABILITY GROUP TO [NT AUTHORITY\SYSTEM]
GO
GRANT CONNECT SQL TO [NT AUTHORITY\SYSTEM]
GO
GRANT VIEW SERVER STATE TO [NT AUTHORITY\SYSTEM]
GO 

-> I tried the failover again after creating NT Authority\System and it worked fine,

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.