Always On – SQL Server hosting availability group ‘JBAG’ did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period

Environment

-> I was working on an application connectivity issue. It was advised that the application had trouble connecting to database server.

-> When we checked the database server, database that the application connects to is part of Always On availability group. This database was online and synchronized. We did not see any issue at a higher level.

-> In the meantime, application team advised back in sometime while we were checking that Application services were restarted and now Application is working fine.

-> I started checking the SQL Server error log and could see below messages,

2020-12-08 02:11:25.51 Server Error: 19421, Severity: 16, State: 1.
2020-12-08 02:11:25.51 Server SQL Server hosting availability group ‘JBAG’ did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.

2020-12-08 02:11:25.60 spid466 Remote harden of transaction ‘implicit_transaction’ (ID 0x00000001d7cac619 0005:b58e22ae) started at Dec 8 2020 02:11AM in database ‘JBDB1’ at LSN (1430698:161982:3) failed.

2020-12-08 02:11:25.88 Server Error: 19407, Severity: 16, State: 1.
2020-12-08 02:11:25.88 Server The lease between availability group ‘JBAG’ and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

2020-12-08 02:11:25.98 Server Always On: The local replica of availability group ‘JBAG’ is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
2020-12-08 02:11:26.47 Server The state of the local availability replica in availability group ‘JBAG’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

2020-12-08 02:11:26.47 spid214s Always On Availability Groups connection with secondary database terminated for primary database ‘JBDB1’ on the availability replica ‘JBSAG2’ with Replica ID: {9c105a39-d2cc-4b31-a913-b21b02b3ddf4}. This is an informational message only. No user action is required.

2020-12-08 02:11:26.47 spid531 Remote harden of transaction ‘implicit_transaction’ (ID 0x00000001d7cac805 0005:b58e22b8) started at Dec 8 2020 02:11AM in database ‘JBDB1’ at LSN (1430698:162003:15) failed.

2020-12-08 02:11:26.48 spid285s The availability group database “JBDB2” is changing roles from “PRIMARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

2020-12-08 02:11:26.48 spid428s The availability group database “JBDB” is changing roles from “PRIMARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

2020-12-08 02:11:26.48 spid300s The availability group database “JBDB1” is changing roles from “PRIMARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

2020-12-08 02:11:26.48 spid75 Remote harden of transaction ‘INSERT’ (ID 0x00000001d7caca99 0000:01445700) started at Dec 8 2020 02:11AM in database ‘JBDB’ at LSN (1099:57413:3) failed.

2020-12-08 02:11:26.48 spid248 Remote harden of transaction ‘implicit_transaction’ (ID 0x00000001d7cac8e7 0005:b58e22ba) started at Dec 8 2020 02:11AM in database ‘JBDB1’ at LSN (1430698:162008:9) failed.

2020-12-08 02:11:26.54 Logon Error: 983, Severity: 14, State: 1.
2020-12-08 02:11:26.54 Logon Unable to access availability database ‘JBDB1’ because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.


2020-12-08 02:11:26.54 Backup Error: 18210, Severity: 16, State: 1.
2020-12-08 02:11:30.79 spid52 Always On: The local replica of availability group ‘JBAG’ is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2020-12-08 02:11:39.53 Server Started listening on virtual network name ‘JBDBAG’. No user action is required.
2020-12-08 02:11:39.54 Server The lease worker of availability group ‘JBAG’ is now sleeping the excess lease time (156313 ms) supplied during online. This is an informational message only. No user action is required.

-> From the messages, I could see that there was a lease time out. LeaseTimeout which happens only on the primary replica controls the lease mechanism and when the lease expires there is a very high probability of a system wide event taking place.

-> The SQL Server resource DLL is responsible for the lease heartbeat activity. Every 1/4 of the LeaseTimeout setting the dedicated, lease thread wakes up and attempts to renew the lease. The lease is a simple handshake between the resource DLL and the SQL Server instance supporting the AG on the same node. For a lease to expire the server lease worker is not able to wait on and respond to the event handshake when it is dedicated to the task and running at a priority above most threads on the system.

-> Details from SQL Server error log,

Initial failure

2020-12-08 02:11:25.51 Server SQL Server hosting availability group ‘JBAG’ did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.

2020-12-08 02:11:25.88 Server Error: 19407, Severity: 16, State: 1.
2020-12-08 02:11:25.88 Server The lease between availability group ‘JBAG’ and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

2020-12-08 02:11:25.98 Server Always On: The local replica of availability group ‘JBAG’ is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.

2020-12-08 02:11:26.47 Server The state of the local availability replica in availability group ‘JBAG’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

Availability group getting online automatically

2020-12-08 02:11:39.53 spid54 The state of the local availability replica in availability group ‘JBAG’ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The state changed because the availability group is coming online. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

2020-12-08 02:11:39.53 Server Started listening on virtual network name ‘JBDBAG’. No user action is required.

2020-12-08 02:11:40.68 Server The state of the local availability replica in availability group ‘JBAG’ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The state changed because the local replica has completed processing Online command from Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

2020-12-01 18:06:34.90 spid48s The availability group database “JBDB” is changing roles from “RESOLVING” to “PRIMARY” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

-> Details from Cluster.log

You can generate cluster.log using this article.

539453 00003926.00009z60::2020/12/07-20:41:26.485 ERR [RES] SQL Server Availability Group : [hadrag] Availability Group lease is no longer valid

539454 00003926.00009z60::2020/12/07-20:41:26.485 ERR [RES] SQL Server Availability Group : [hadrag] Resource Alive result 0.

539455 00003926.00009z60::2020/12/07-20:41:26.485 ERR [RES] SQL Server Availability Group : [hadrag] Availability Group lease is no longer valid

539456 00003926.00009z60::2020/12/07-20:41:26.485 ERR [RES] SQL Server Availability Group : [hadrag] Resource Alive result 0.

539458 00002546.00008ge4::2020/12/07-20:41:26.485 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for ‘JBAG’, gen(0) result 1/0.

539472 00003926.00005h38::2020/12/07-20:41:26.544 ERR [RES] SQL Server Availability Group : [hadrag] Lease renewal failed because the existing lease is no longer valid.

539473 00003926.00005h38::2020/12/07-20:41:26.544 ERR [RES] SQL Server Availability Group : [hadrag] The lease is expired. The lease should have been renewed by 2020/12/07-20:41:15.482

-> It is clear from the entries in cluster.log that SQL Server did not complete the handshake with resource DLL and this caused the lease time out. This resulted in Availability group going into resolving state and the database was not accessible.

-> In this case Availability group JBAG went to resolving state and then recovered automatically in sometime. There are many reason why lease time out fails, but most of the time it is due to a resource issue on the Primary replica.

-> Fortunately in my case we capture performance monitor logs on all database server. I checked the logs and was able to see that CPU utilization at that time was 100% when the lease timeout happened.

-> High resource utilization on the database server would have resulted SQL server not to respond for sometime and this SQL Server freeze like situation would have resulted in lease timeout.

-> I checked the process counter in Perfmon BLG file and was able to see SQL Server utilizing most of the CPU and some system processes “Privileged time” was also high. We get to know that application load at that time was very high and that could be a contributing factor. But we couldn’t get more details as what queries within SQL server could have increased SQL Server CPU utilization as we did not have any trace at that time.

-> In case this issue occurrence is frequent, we can try increasing the LeaseTimeout value.

-> But increasing LeaseTimeout value will just be a workaround for time being, but the actual resolution will be to understand the reason behind high CPU utilization. A typical performance troubleshooting on the database server will be required/enough to understand the reason and fix it.

More Reads

How It Works: SQL Server AlwaysOn Lease Timeout | Microsoft Docs
Diagnose Unexpected Failover or Availability Group in RESOLVING State | Microsoft Docs

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.