Proactively Managing Transactional Replication Latency with SQL Server

Transactional replication is a critical component of many SQL Server environments, providing high availability, load balancing, and other essential benefits. However, managing replication latency, the delay between an action occurring on the publisher and it being reflected on the subscriber, is vital for ensuring system performance and data integrity. In this blog post, we’ll explore a proactive approach to monitor and alert on replication latency, helping database administrators (DBAs) maintain optimal system health.

The Issue:

Replication latency can sometimes go unnoticed until it impacts the system performance or data accuracy, leading to potential data loss or business disruptions. Traditional monitoring techniques may not provide real-time alerts or may require significant manual intervention, making them less effective for immediate latency identification and resolution.

The Script:

To address this challenge, we introduce a SQL script designed by Vivek Janakiraman from JBSWiki, specifically crafted to monitor transactional replication latency in SQL Server environments. This script efficiently posts tracer tokens to specified publications and measures the time taken for these tokens to move through the replication components, providing a clear picture of any latency present in the system.

/*
Author: Vivek Janakiraman
Company: JBSWiki
Description: This script is used to alert in case there is Transactional replication Log reader or distribution agent latency.
It posts tracer tokens to specified publications and measures the latency to the distributor and subscriber.
*/

-- Switch to the publisher database to insert tracer tokens.
USE [Publisher_Database_Here] -- Use your publisher database name here.
-- Insert tracer tokens into the specified publications.
EXEC sys.sp_posttracertoken @publication = 'Publication_Name' -- Change appropriate Publication that should be monitored.
EXEC sys.sp_posttracertoken @publication = 'Publication_Name1' -- Change appropriate Publication that should be monitored.
-- Wait for 5 minutes to allow the tokens to propagate.
WAITFOR DELAY '00:05:00'

-- Switch to the distribution database to query latency information.
USE distribution
;WITH LatestEntries AS (
-- Select the latest entries for each publication and agent.
SELECT publication_id, agent_id, MAX(publisher_commit) AS MaxDate
FROM MStracer_tokens t
JOIN MStracer_history h ON t.tracer_id = h.parent_tracer_id
GROUP BY publication_id, agent_id
)
-- Select latency information for the latest tokens.
SELECT c.name, t.publication_id, h.agent_id, t.publisher_commit,
ISNULL(DATEDIFF(s,t.publisher_commit,t.distributor_commit), 299) as 'Time To Dist (sec)',
ISNULL(DATEDIFF(s,t.distributor_commit,h.subscriber_commit), 299) as 'Time To Sub (sec)'
INTO #REPL_LATENCY
FROM MStracer_tokens t
JOIN MStracer_history h ON t.tracer_id = h.parent_tracer_id
JOIN distribution.dbo.MSdistribution_agents c ON h.agent_id = c.id
JOIN LatestEntries le ON t.publication_id = le.publication_id AND h.agent_id = le.agent_id AND t.publisher_commit = le.MaxDate
ORDER BY t.publisher_commit DESC

-- Check if there is any latency beyond acceptable limits and select those records.
IF EXISTS (SELECT 1 FROM #REPL_LATENCY WHERE ([Time To Dist (sec)] > 30 OR [Time To Sub (sec)] > 30))
BEGIN
SELECT name, publication_id, agent_id, publisher_commit, [Time To Dist (sec)], [Time To Sub (sec)]
INTO #REPL_LATENCY_Email
FROM #REPL_LATENCY
WHERE ([Time To Dist (sec)] > 30 OR [Time To Sub (sec)] > 30)
END

-- Prepare the HTML body content for the email alert.
DECLARE @body_content NVARCHAR(MAX);
SET @body_content = N'
<style>
table.GeneratedTable {
width: 100%;
background-color: #D3D3D3;
border-collapse: collapse;
border-width: 2px;
border-color: #A9A9A9;
border-style: solid;
color: #000000;
}
table.GeneratedTable td, table.GeneratedTable th {
border-width: 2px;
border-color: #A9A9A9;
border-style: solid;
padding: 3px;
}
table.GeneratedTable thead {
background-color: #A9A9A9;
}
</style>
<table class="GeneratedTable">
<thead>
<tr>
<th>name</th>
<th>publication_id</th>
<th>agent_id</th>
<th>publisher_commit</th>
<th>[Time To Dist (sec)]</th>
<th>[Time To Sub (sec)]</th>
</tr>
</thead>
<tbody>' +
CAST(
(SELECT td = name, '',
td = publication_id, '',
td = agent_id, '',
td = publisher_commit, '',
td = [Time To Dist (sec)], '',
td = [Time To Sub (sec)], ''
FROM [dbo].#REPL_LATENCY_Email
FOR XML PATH('tr'), TYPE
) AS NVARCHAR(MAX)
) +
N'</tbody>
</table>';

-- Send an email alert if there is any latency issue found.
IF EXISTS (SELECT TOP 1 * FROM [dbo].#REPL_LATENCY_Email)
BEGIN
EXEC msdb.dbo.sp_send_dbmail @profile_name = 'JBSWIKI',
@body = @body_content,
@body_format = 'HTML',
@recipients = 'jvivek2k1@yahoo.com',
@subject = 'ALERT: Transactional Replication Latency Alert';
END

-- Cleanup temporary tables.
DROP TABLE #REPL_LATENCY
DROP TABLE #REPL_LATENCY_Email

The Solution:

The script works by first posting tracer tokens to the specified publications within the publisher database. It then waits for a predetermined amount of time (defaulted to 5 minutes in the script) to allow the tokens to propagate through the system. Following this, the script measures the latency to the distributor and subscriber, providing a detailed report of the time taken in each stage of the replication process.

This information is then used to generate an HTML-formatted email alert if the latency exceeds predefined thresholds (30 seconds in the provided script), allowing for immediate action to be taken. The use of HTML formatting in the email ensures that the information is presented in an easily digestible format, facilitating quick understanding and response by the DBA.

Conclusion:

Proactive monitoring and management of transactional replication latency are paramount for maintaining the health and performance of SQL Server environments. The script provided offers a straightforward and effective solution for DBAs to stay ahead of potential replication issues. By automating the process of latency detection and alerting, this approach not only saves valuable time but also helps in preventing the negative impact of replication latency on business operations.

Remember, while this script serves as a valuable tool in your monitoring arsenal, it’s also important to tailor the solution to your specific environment and requirements. Regularly reviewing and adjusting the latency thresholds and monitoring frequency will ensure you continue to get the most out of your replication setup.

For more tutorials and tips on SQL Server, including performance tuning and database management, be sure to check out our JBSWiki YouTube channel.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

SQL Server Always On Series: Availability group DDL operations are permitted only when you are using the master database


Introduction

During a recent attempt to perform a manual failover from Always On Availability Group from JBSAG1 to JBSAG2 using Microsoft SQL Server Management Studio (SSMS), an unexpected error disrupted the process. The error, specifically identified as Microsoft SQL Server Error 35208, posed a challenge to the manual failover operation. In this detailed account, we’ll explore the encountered issue, outline the steps taken to address it, and ultimately achieve a successful manual failover.

Issue
Upon initiating the Always On Availability Group manual failover, the process encountered a hurdle with the following error:

TITLE: Microsoft SQL Server Management Studio
Manual Failover failed (Microsoft.SqlServer.Management.HadrTasks)

ADDITIONAL INFORMATION:
Failed to perform a manual failover of the availability group ‘JBSWiki’ to server instance ‘JBSAG2’. (Microsoft.SqlServer.Management.HadrModel)

For help, click: https://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=16.100.47021.0&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&LinkId=20476

An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

Availability group DDL operations are permitted only when you are using the master database. Run the ‘USE master’ statement, and retry your availability group DDL statement. (Microsoft SQL Server, Error: 35208)
For help, click: https://docs.microsoft.com/sql/relational-databases/errors-events/mssqlserver-35208-database-engine-error

Navigating through this error was the initial challenge, particularly when attempting the failover using the SSMS graphical user interface (GUI). Executing the ‘USE master’ statement in this context presented uncertainties, prompting a reevaluation of the failover approach.

Solution
To address the dilemma, the following steps were taken:

GUI Failover Attempt:

  • Initially, the failover was attempted through the SSMS GUI, raising questions about how to execute the ‘USE master’ statement within the graphical interface.

Scripted Failover Action:

  • The failover wizard was restarted, and instead of concluding the process through the GUI, the failover action was scripted for manual execution. I started the failover wizard again and this time instead of clicking finish at the end, tried scripting the failover action.

Manual Execution of Failover Command:

  • The failover command was manually executed from the SSMS SQLCMD query window, successfully completing the failover process.

Database Context Discovery:

  • Investigation revealed that the database context for a new query window was set to a user database (JBDB) instead of ‘master,’ leading to a pivotal realization. I tried clicking on a new query window and I saw that the database context for that query window was set to an user database JBDB and not master.

Connection Options Adjustment:

  • This is when I realized that I might have connected to the SQL Server instance with an User Database specified on the “Connect to database” in “Options <<” as part of making a connection from SSMS. Please check screenshot below,

After realizing this, I disconnected the existing sessions on SSMS and changed “Connect to database” in “Options <<” to “Master” and connected to SQL server JBSAG1 and JBSAG2.

Success After Correction:

  • Following these corrective actions, subsequent manual failover and failback attempts via SSMS were executed seamlessly.

Summary
In summary, this journey through troubleshooting manual failover with error 35208 underscored the importance of the database context, especially when initiating DDL operations. The solution involved a meticulous adjustment of connection options within SSMS, ensuring a connection to the ‘master’ database before attempting manual failover. This article provides a detailed account of the encountered challenge, the thought process behind the solution, and the successful resolution achieved through careful steps and insights.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.