Nurturing the Heartbeat: A Comprehensive Guide to Always On Availability Group Health Checks with T-SQL

Introduction 🌐

In the dynamic landscape of SQL Server, ensuring the robust health of your Always On Availability Groups (AG) is crucial for maintaining high availability and minimizing downtime. This blog serves as your compass to navigate the intricacies of AG health checks, providing insights into the vitality of your primary replica. Let’s embark on a journey to fortify the resilience of your SQL Server environment.

Requirement 🚦

Maintaining the optimal health of your Always On Availability Group demands continuous vigilance. Proactive monitoring enables database administrators to identify potential issues before they escalate. The T-SQL script shared below equips you with a powerful diagnostic tool to assess the health of your AGs, ensuring a reliable and resilient SQL Server environment.

T-SQL Script πŸ”

DECLARE @HADRName varchar(25)

SET @HADRName = @@SERVERNAME

SELECT
n.group_name,
n.replica_server_name,
n.node_name,
rs.role_desc,
db_name(drs.database_id) AS 'DBName',
drs.synchronization_state_desc,
drs.synchronization_health_desc
FROM
sys.dm_hadr_availability_replica_cluster_nodes n
JOIN sys.dm_hadr_availability_replica_cluster_states cs ON n.replica_server_name = cs.replica_server_name
JOIN sys.dm_hadr_availability_replica_states rs ON rs.replica_id = cs.replica_id
JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id = drs.replica_id
WHERE
n.replica_server_name <> @HADRName

Script Explanation

  • group_name: Name of the availability group.
  • replica_server_name: Name of the replica server.
  • node_name: Name of the node in the cluster.
  • role_desc: Describes the role of the replica.
  • DBName: Name of the database.
  • synchronization_state_desc: Describes the synchronization state.
  • synchronization_health_desc: Describes the synchronization health.

Conclusion πŸŽ“

Executing the provided T-SQL script on the primary replica serves as a proactive measure to assess the health of your Always On Availability Groups. By scrutinizing critical metrics such as synchronization state and health, database administrators can anticipate and resolve issues, ensuring a resilient SQL Server environment.

In conclusion, incorporating regular health checks into your SQL Server maintenance routine is akin to nurturing the heartbeat of your database infrastructure. This script is not just a diagnostic tool; it’s a guardian of high availability and a testament to your commitment to database reliability. πŸ›‘οΈ

Stay tuned for more insights into SQL Server best practices, tips, and advanced techniques. Don’t forget to share this knowledge with your fellow SQL Server enthusiasts. Happy scripting! πŸ’» #SQLServer #AlwaysOn #HighAvailability #DatabaseManagement #TSQLScript

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided β€œAS IS” with no warranties, and confers no rights.

Ensuring Always On: A Guide to Verifying Availability Replica Connections with T-SQL πŸš€

Introduction 🌐

In the dynamic realm of SQL Server high availability, maintaining the connectivity of Always On Availability Groups is paramount. This blog serves as a comprehensive guide to ensuring your Availability Replicas are securely connected through the utilization of a T-SQL script. Let’s embark on a journey to validate the heartbeat of your SQL Server infrastructure.

Requirement 🚦

Ensuring the seamless operation of Always On Availability Groups demands continuous monitoring of the connection status between replicas. This becomes particularly crucial in scenarios where maintaining high availability and minimizing downtime are non-negotiable. The T-SQL script provided below equips you with a diagnostic tool to assess the connection health of your Availability Replicas.

T-SQL Script πŸ”

Checking Availability Replica Connections:

select r.replica_server_name, r.endpoint_url,

rs.connected_state_desc, rs.last_connect_error_description,
rs.last_connect_error_number, rs.last_connect_error_timestamp
from sys.dm_hadr_availability_replica_states rs join sys.availability_replicas r
on rs.replica_id=r.replica_id
where rs.is_local=1

Script Explanation

  • replica_server_name: Identifies the name of the Availability Replica.
  • endpoint_url: Specifies the URL of the endpoint.
  • connected_state_desc: Describes the current state of the connection.
  • last_connect_error_description: Provides a description of the last connection error.
  • last_connect_error_number: Displays the error number of the last connection error.
  • last_connect_error_timestamp: Timestamp of the last connection error.

Conclusion πŸŽ“

Regularly executing the provided T-SQL script on the Primary Replica allows for proactive identification and resolution of potential connectivity issues within your SQL Server Always On Availability Groups. Monitoring the connected state, error descriptions, and timestamps empowers database administrators to take preemptive actions, ensuring high availability and minimal disruptions.

In conclusion, this straightforward yet powerful script should be an integral part of your SQL Server maintenance toolkit. By embracing a proactive approach to availability, you safeguard the robustness of your database infrastructure, providing users with a seamless experience and guaranteeing the continuity of critical operations. πŸ›‘οΈ

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided β€œAS IS” with no warranties, and confers no rights.

Always ON Availability group in RESOLVING state

Environment

-> Application team advised that their application is not able to connect to the database that is part of Always on availability group.

-> Always on Availability group was in resolving state as shown below,

-> Always on Availability group roles were in pending state on the Cluster administrator,

-> Below were the messages found on event viewer,

Event ID: 41144
The local availability replica of availability group ‘JBSAG’ is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: 41005). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.

Event ID: 1205
The Cluster service failed to bring clustered role ‘JBSAG’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role. Event ID: 1069
Cluster resource ‘JBSAG’ of type ‘SQL Server Availability Group’ in clustered role ‘JBSAG’ failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Event ID: 7043
The Cluster Service service did not shut down properly after receiving a preshutdown control.

-> Below errors were there on SQL Server error log,

Error: 41022, Severity: 16, State: 0.
Failed to create a Windows Server Failover Clustering (WSFC) notification port with notification filter 778567686 and notification key 3 (Error code 5073). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the specified arguments are invalid. Otherwise, contact your primary support provider. For information about this error code, see “System Error Codes” in the Windows Development documentation.
Always On: The availability replica manager is going offline because the local Windows Server Failover Clustering (WSFC) node has lost quorum. This is an informational message only. No user action is required.
Always On: The local replica of availability group ‘JBSAG’ is stopping. This is an informational message only. No user action is required.

Error: 41066, Severity: 16, State: 0.
Cannot bring the Windows Server Failover Clustering (WSFC) resource (ID ‘ee50bbc1-93ab-4f25-85e5-a7d245555183’) online (Error code 126). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the WSFC resource may not be in a state that could accept the request. Otherwise, contact your primary support provider. For information about this error code, see “System Error Codes” in the Windows Development documentation.

Error: 41160, Severity: 16, State: 0.
Failed to designate the local availability replica of availability group ‘JBSAG’ as the primary replica. The operation encountered SQL Server error 41066 and has been terminated. Check the preceding error and the SQL Server error log for more details about the error and corrective actions. Error: 41017, Severity: 16, State: 1.
Failed to add a node to the possible owner list of a Windows Server Failover Clustering (WSFC) resource (Error code 5908). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the specified cluster resource or node handle is invalid. Otherwise, contact your primary support provider.

-> Cluster.log did not have much details other than the AG group failing.

-> All messages or errors were all generic. We could not get much clue.

-> We executed Process monitor and found that DBA team have renamed C:\windows\system32\hadrres.dll to hadrres_old.dll due to a patching error on both JBSAG1 and JBSAG2. They had to rename it as patching was failing with an error that hadrres.dll is used by another process. DBA team forgot to rename it back to hadrres.dll and this caused the issues. We renamed the file back to hadrres.dll and that solved the issue.

Watch this video to grasp a real-time understanding of this matter.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided β€œAS IS” with no warranties, and confers no rights.