Application Timeout with Multi-subnet Always On Availability Group

-> I recently worked on an Application timeout issue involving Multi-subnet Always On availability group.

Environment

AG_Multisubnet.PNG

-> Environment consists of two (2) Windows 2016 virtual machines. One server is located in the Primary Datacentre and the second in the Secondary Datacentre.

-> These servers were configured as nodes of a two (2) node Windows Server Failover Cluster (without shared storage). An instance of SQL Server 2016 Enterprise Edition is installed on each server; each instance act as an Always-On Availability Group Replica.

-> The replica in the Primary Datacentre will be a member of an Automatic Failover Set with synchronous commit to the replica in secondary Datacentre. Latency between Primary and secondary Datacentre is less than 2 MS.

-> This environment is configured as a Multi-Subnet environment.

-> Application team advised of intermittent connectivity issue .

Issue

-> The availability group Listener is configured with an IP address from each defined  subnet. This means that the Availability group listener will have an IP address of 192.150.10.15 when it resides in Primary datacentre and 192.150.0.15 when it resides on Secondary datacentre.

-> Client Operating system queries the DNS server to resolve the Listener name to IP address. DNS will return 2 IP address in this environment with one (1) IP address that the subnet currently hosting AG Primary replica will be online and the other IP Address that the subnet hosting secondary replica will be offline. Client application using the AlwaysOn Listener can have connectivity issues while connecting to it.

-> Application tries all IP Address one by one and connects to the one that is online. Since this is done serially, there is high possibility that the application reaches its timeout value and the connection terminates with a timeout error.

Workaround\Fix

-> Adding MultiSubnetFailover parameter to True in Application connection string. When True, Application connection tries all AG Listener IP Address in parallel. This will avoid the connection timeout caused when trying serially.

-> If adding MultiSubnetFailover parameter to the Application Connection String is not possible. Then Below should be performed.

-> Parameter RegisterAllprovidersIP should be set to 0. When RegisterAllprovidersIP is set to 0, only the active Listener IP Address is registered in the DNS. When set to 1 (DEFAULT), all of the IP Address the Listener is dependent on is registered with the DNS.


# Note down the LISTENER name from the output of below command
PS C:\Windows\system32> Get-ClusterResource

#  Replace LISTENERNAME_FROM_OUTPUT with the value noted from above output. Check RegisterAllprovidersIP value and it should be 1
PS C:\Windows\system32> Get-ClusterResource -Name LISTENERNAME_FROM_OUTPUT | Get-ClusterParameter

# Execute below to set RegisterAllProvidersIP to 0
Get-ClusterResource -Name LISTENERNAME_FROM_OUTPUT | Set-ClusterParameter RegisterAllProvidersIP 0

# Check RegisterAllprovidersIP value and see if it set to 0
PS C:\Windows\system32> Get-ClusterResource -Name LISTENERNAME_FROM_OUTPUT | Get-ClusterParameter

-> Parameter HostRecordTTL should be set with a value 60 to 300 from the default value of 1200 (20 Minutes). HostRecordTTL parameter decides how long in seconds the Client operating system will query the DNS for the current IP address. Reducing this value can have an adverse effect on your DNS server if there are several servers connecting to the DNS to resolve the Listener IP Address. Hence it is advised as 60 to 300. Personally, I have set this value to 60 and never seen any issue.


# Note down the LISTENER name from the output of below command.
PS C:\Windows\system32> Get-ClusterResource

# Replace LISTENERNAME_FROM_OUTPUT with the value noted from above output. Check HostRecordTTL value and it should be 1200
PS C:\Windows\system32> Get-ClusterResource -Name LISTENERNAME_FROM_OUTPUT | Get-ClusterParameter

# Execute below to set HostRecordTTL to 60
Get-ClusterResource -Name LISTENERNAME_FROM_OUTPUT | Set-ClusterParameter HostRecordTTL 60

# Check HostRecordTTL value and see if it set to 60
PS C:\Windows\system32> Get-ClusterResource -Name LISTENERNAME_FROM_OUTPUT | Get-ClusterParameter

-> Failover the Availability group to have the above changes take effect.

-> Speak to your DNS team and check for the Listener’s Host (A) record in the DNS Server and confirm that these records are not static.

Testing

-> Once the above tasks are performed. Run a “nslookup <Listener_Name>” on each of the replica and make sure you see the IP Address with respect to the subnet currently hosting AG Primary replica on all your replica. It may take sometime to get the correct IP Address on the replica that is not in the subnet currently hosting AG Primary replica. This depends on the DNS replication schedule. Note down the delay and speak to your DNS and Application team. If they are fine with this delay, I think everything is set. If they are not fine, then your DNS Team will basically have to change the DNS Sync as appropriate.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Advertisements

Finding Transaction latency in an Always on Synchronous replica

1) Execute below query on your Always on Primary replica,


-- Create required objects to collect the data
use [Tempdb]
GO
IF OBJECT_ID('Perfcounters') IS NOT NULL
DROP TABLE Perfcounters
GO
CREATE TABLE Perfcounters
(CapturedTime DATETIME NOT NULL,
[Transaction Delay] BIGINT, [Mirrored Write Transactions/sec] BIGINT
)
GO

ALTER TABLE [dbo].Perfcounters
ADD [Transaction Delay in MS]
AS (convert(decimal(5,2),[Transaction Delay]*1.0/[Mirrored Write Transactions/sec]*1.0))
PERSISTED NOT NULL

2) Below query can be executed in a query window or as a SQL Server agent job in primary replica. Remember to change “Instance_Name=’Alwayson Database Name'” to appropriate Alwayson database name before executing the query,


use [Tempdb]
GO
set nocount on
while (1=1)
begin
insert into Perfcounters
SELECT CURRENT_TIMESTAMP as Time
,pt.* FROM(SELECT RTRIM(object_name) + ' : ' + counter_name CounterName
, cntr_value from sys.dm_os_performance_counters
where Instance_Name='Alwayson Database Name' and counter_name in ('Transaction Delay','Mirrored Write Transactions/sec')) as SourceData
PIVOT
(SUM(cntr_value)
FOR CounterName
IN ([SQLServer:Database Replica : Transaction Delay],[SQLServer:Database Replica : Mirrored Write Transactions/sec])
) pt
waitfor delay '00:00:03'  -- Change this as per your requirement for collection
End

3) While the above query is getting executed from your query window or SQL Server agent job, you can execute the below query to get the latency values,


use [Tempdb]
GO
select * from Perfcounters order by CapturedTime desc

4) Remember to stop the query running as part of step 2 in this article on query window or as a SQL Agent job when not required.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Listener entry in sys.availability_group_listeners exists even after related Availability group is deleted

Environment

Alwayson_Visio.JPG

-> A request comes in to drop the existing Always-on availability group JBDB from above environment.

-> Alwayson Availability group should be deleted/dropped by connecting to Primary Replica JBAG1 using SQL Server Management Studio.

-> But, someone tried deleting the Availability group from SQL Server management studio connected using the LISTENER JBAPP and got below error,

TITLE: Microsoft SQL Server Management Studio
——————————
Drop failed for Availability Group ‘JBDB’. (Microsoft.SqlServer.Management.SDK.TaskForms)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=14.0.17224.0+((SSMS_Rel_17_4).180212-1043)&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&EvtID=Drop+AvailabilityGroup&LinkId=20476
——————————
ADDITIONAL INFORMATION:
An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
——————————
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 53)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=53&LinkId=20476
——————————
The network path was not found
——————————
BUTTONS:
OK
——————————

Blog50_1.JPG

-> When “Always On High Availability” folder for SQL Server Instance JBAG1 which is the primary is refreshed from SQL Server management studio, “Availability group” JBDB is not available.

-> But you will see Availability group JBDB “Resolving” at the secondary replica “JBAG2”. At this point you delete the Availability group JBDB from your secondary replica JBAG2 from SSMS.

-> When connected to cluster administrator (cluadmin.msc), you will not see roles related to Availability group JBDB visible.

-> Lets try creating another Availability group with same name JBDB that contains listener JBAPP. It fails with below error,

TITLE: Microsoft SQL Server Management Studio
——————————
Create failed for Availability Group ‘JBDB’. (Microsoft.SqlServer.Management.HadrModel)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=14.0.17224.0+((SSMS_Rel_17_4).180212-1043)&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&EvtID=Create+AvailabilityGroup&LinkId=20476
——————————
ADDITIONAL INFORMATION:
An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
——————————
Failed to create availability group ‘JBDB’, because a Windows Server Failover Cluster (WSFC) group with the specified name already exists. The operation has been rolled back successfully. To retry creating an availability group, either remove or rename the existing WSFC group, or retry the operation specifying a different availability group name. (Microsoft SQL Server, Error: 41171)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&ProdVer=14.00.1000&EvtSrc=MSSQLServer&EvtID=41171&LinkId=20476
——————————
BUTTONS:
OK
——————————

Blog50_2.JPG

-> The error advises us that the ROLE JBDB is already exists in cluster administrator. But we have checked it and it is not present.

-> Execute below query on JBAG1 which was acting as the primary replica when Availability group JBDB was deleted from SSMS using Listener name,


select * from sys.availability_group_listeners
go
select * from sys.availability_groups

Blog50_3.JPG

-> You see an orphaned listener entry under system table sys.availability_group_listeners. But no corresponding entries in sys.availability_groups.

-> Lets first try restarting the SQL Services on JBAG1 and then try running the below queries,

 

select * from sys.availability_group_listeners
go
select * from sys.availability_groups 

-> The orphaned entry is no longer present in sys.availability_group_listeners after the SQL Server restart.

-> Lets try creating Availability group with same name JBDB that contains listener JBAPP and it works fine this time.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.