Environment

-> Database server JBSAG1 and JBSAG2 are located in East US.
-> Database servers JBSAG1 and JBSAG2are part of failover cluster without shared storage.
-> Below is the status of Always On availability group currently,


-> Below performance monitor logs are currently executing on JBSAG1 and JBSAG2,

-> Rebuild index is executing on JBSAG1.
-> JBSAG1 goes down. An automatic failover of Availability group JBSWIKI happens from JBSAG1 to JBSAG2.
-> Database Server JBSAG1 comes online. JBSAG2 becomes current primary and JBSAG1 becomes secondary.
-> Checking the status of Always On availability group,


-> On Secondary JBSAG1, database JBDB is in “Reverting / In Recovery” State.
-> Executing below query on secondary replica JBSAG1,
select @@servername as SQLName,db_name(database_id) as DBName,synchronization_state_desc,database_state_desc
from sys.dm_hadr_database_replica_states where is_local=1

-> Executing same query on primary replica JBSAG2,

-> Below are the messages found in Always On Dashboard,
The availability group is not ready for automatic failover. The primary replica and a secondary replica are configured for automatic failover, however, the secondary replica is not ready for an automatic failover. Possibly the secondary replica is unavailable, or its data synchronization state is currently not in the SYNCHRONIZED synchronization state.
In this availability group, at least one secondary replica has a NOT SYNCHRONIZING synchronization state and is not receiving data from the primary replica.
In this availability group, at least one synchronous replica is not currently synchronized. The replica synchronization state could be either SYNCHONIZING or NOT SYNCHRONIZING.
-> Checking SQL Server errorlog,
JBSAG1
2021-02-28 06:21:25.340 spid9s Always On: The availability replica manager is going offline because SQL Server is shutting down. This is an informational message only. No user action is required.
-> Availability group is going offline as per above message.
2021-02-28 06:21:25.340 spid9s Always On: The local replica of availability group ‘JBSWIKI’ is stopping. This is an informational message only. No user action is required.
2021-02-28 06:21:25.340 Server The availability group ‘JBSWIKI’ is being asked to stop the lease renewal because the availability group is going offline. This is an informational message only. No user action is required.
2021-02-28 06:21:25.340 spid9s SQL Server is terminating in response to a ‘stop’ request from Service Control Manager. This is an informational message only. No user action is required.
2021-02-28 06:21:25.450 spid25s The Database Mirroring endpoint has stopped listening for connections.
2021-02-28 06:21:25.450 spid25s Service Broker manager has shut down.
2021-02-28 06:21:35.450 spid9s .NET Framework runtime has been stopped.
2021-02-28 06:21:38.110 spid9s SQL Trace was stopped due to server shutdown. Trace ID = ‘1’. This is an informational message only; no user action is required.
2021-02-28 06:21:38.710 spid19s The SQL Server Network Interface library successfully deregistered the Service Principal Name (SPN) [ MSSQLSvc/JBSAG1.JBS.COM:1433 ] for the SQL Server service.
-> SQL Services comes online on JBSAG1 and below are the logs,
2021-02-28 06:22:11.890 spid25s Always On Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.
2021-02-28 06:22:11.890 spid29s Starting up database ‘JBDB’.
2021-02-28 06:22:11.890 spid11s Starting up database ‘mssqlsystemresource’.
2021-02-28 06:22:11.890 spid25s Always On Availability Groups: Local Windows Server Failover Clustering service started. This is an informational message only. No user action is required.
2021-02-28 06:22:11.890 spid25s Always On Availability Groups: Waiting for local Windows Server Failover Clustering node to start. This is an informational message only. No user action is required.
2021-02-28 06:22:11.890 spid25s Always On Availability Groups: Local Windows Server Failover Clustering node started. This is an informational message only. No user action is required.
-> Windows cluster service is online.
2021-02-28 06:22:11.900 spid29s Skipping the default startup of database ‘JBDB’ because the database belongs to an availability group (Group ID: 65536). The database will be started by the availability group. This is an informational message only. No user action is required.
2021-02-28 06:22:11.900 spid25s Always On: The availability replica manager is waiting for the instance of SQL Server to allow client connections. This is an informational message only. No user action is required.
2021-02-28 06:22:11.900 spid25s Always On: The local replica of availability group ‘JBSWIKI’ is starting. This is an informational message only. No user action is required.
2021-02-28 06:22:11.910 spid25s The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘NOT_AVAILABLE’ to ‘RESOLVING_NORMAL’. The state changed because the local instance of SQL Server is starting up. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 06:22:11.940 spid11s The resource database build version is 15.00.4073. This is an informational message only. No user action is required.
2021-02-28 06:22:11.940 spid31s The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
-> Replica JBSAG1 becomes secondary.
2021-02-28 06:22:12.530 spid32s Parallel redo is started for database ‘JBDB’ with worker pool size [4].
2021-02-28 06:22:12.530 spid32s Recovery of database ‘JBDB’ (5) is 0% complete (approximately 3464 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
-> Recovery starts as per above message.
2021-02-28 06:22:21.950 spid33s A connection timeout has occurred while attempting to establish a connection to availability replica ‘JBSAG2’ with id [531FB0D2-4930-4E75-B1FE-2875DCCB9FFB]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
2021-02-28 06:22:22.020 spid25s A connection for availability group ‘JBSWIKI’ from availability replica ‘JBSAG1’ with id [1B5043DB-0107-4299-A2BA-9D156E056542] to ‘JBSAG2’ with id [531FB0D2-4930-4E75-B1FE-2875DCCB9FFB] has been successfully established. This is an informational message only. No user action is required.
2021-02-28 06:22:32.540 spid32s Recovery of database ‘JBDB’ (5) is 10% complete (approximately 180 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:22:52.540 spid32s Recovery of database ‘JBDB’ (5) is 20% complete (approximately 159 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:12.540 spid32s Recovery of database ‘JBDB’ (5) is 30% complete (approximately 139 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:13.770 spid32s Recovery of database ‘JBDB’ (5) is 30% complete (approximately 138 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:13.780 spid32s Always On Availability Groups connection with primary database established for secondary database ‘JBDB’ on the availability replica ‘JBSAG2’ with Replica ID: {531fb0d2-4930-4e75-b1fe-2875dccb9ffb}. This is an informational message only. No user action is required.
-> Database JBDB is restarted on JBSAG1 to synchronize with primary.
2021-02-28 06:39:31.040 spid32s Error: 35278, Severity: 17, State: 1.
2021-02-28 06:39:31.040 spid32s Availability database ‘JBDB’, which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.
2021-02-28 06:39:31.040 spid32s Parallel redo is shutdown for database ‘JBDB’ with worker pool size [4].
2021-02-28 06:39:31.650 spid32s Starting up database ‘JBDB’.
2021-02-28 06:39:35.240 spid32s Parallel redo is started for database ‘JBDB’ with worker pool size [4].
2021-02-28 06:39:35.270 spid32s Recovery of database ‘JBDB’ (5) is 0% complete (approximately 608 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:39:43.800 spid32s Recovery of database ‘JBDB’ (5) is 25% complete (approximately 26 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:39:44.200 spid32s Always On Availability Groups connection with primary database established for secondary database ‘JBDB’ on the availability replica ‘JBSAG2’ with Replica ID: {531fb0d2-4930-4e75-b1fe-2875dccb9ffb}. This is an informational message only. No user action is required.
2021-02-28 06:39:44.200 spid32s The recovery LSN (1558:4889:2) was identified for the database with ID 5. This is an informational message only. No user action is required.
2021-02-28 06:39:44.210 spid32s Recovery completed for database JBDB (database ID 5) in 13 second(s) (analysis 25 ms, redo 8533 ms, undo 0 ms [system undo 222122029433 ms, regular undo 217368571615 ms].) This is an informational message only. No user action is required.
2021-02-28 06:39:44.220 spid46s Always On Availability Groups connection with primary database established for secondary database ‘JBDB’ on the availability replica ‘JBSAG2’ with Replica ID: {531fb0d2-4930-4e75-b1fe-2875dccb9ffb}. This is an informational message only. No user action is required.
2021-02-28 06:39:44.220 spid46s The recovery LSN (1558:4889:2) was identified for the database with ID 5. This is an informational message only. No user action is required.
From above logs, below details are clear.
-> Failover was initiated at 28 Feb 2021 06:21:25.
-> Secondary database on JBSAG1 is available and synchronized with primary around 28 Feb 2021 06:39:44.
-> Secondary database JBDB on JBSAG1 synchronized with Primary after 18 + minutes. It seems like most of the time was spent in “Reverting \ In Recovery”.
-> The access to readable secondary database JBDB will have issues till the state is “Reverting \ In Recovery”.
-> The rollback on secondary replica should complete before the synchronization process from Primary can start. Until then the secondary database can not be accessible.
-> Lets check the error logs in JBSAG2.
JBSAG2
2021-02-28 06:21:25.410 spid47s Always On: The local replica of availability group ‘JBSWIKI’ is going offline because the corresponding resource in the Windows Server Failover Clustering (WSFC) cluster is no longer online. This is an informational message only. No user action is required.
2021-02-28 06:21:25.410 spid47s The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 06:21:25.410 spid20s The availability group database “JBDB” is changing roles from “SECONDARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2021-02-28 06:21:25.410 spid20s State information for database ‘JBDB’ – Hardened Lsn: ‘(1557:127872:1)’ Commit LSN: ‘(1557:126710:63)’ Commit Time: ‘Feb 28 2021 6:20AM’
2021-02-28 06:21:25.710 spid82s Always On Availability Groups connection with primary database terminated for secondary database ‘JBDB’ on the availability replica ‘JBSAG1’ with Replica ID: {1b5043db-0107-4299-a2ba-9d156e056542}. This is an informational message only. No user action is required.
2021-02-28 06:21:26.730 spid131 Always On: The local replica of availability group ‘JBSWIKI’ is preparing to transition to the primary role. This is an informational message only. No user action is required.
2021-02-28 06:21:27.350 spid131 The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The state changed because the availability group is coming online. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 06:21:27.350 Server The lease worker of availability group ‘JBSWIKI’ is now sleeping the excess lease time (164250 ms) supplied during online. This is an informational message only. No user action is required.
2021-02-28 06:21:28.500 Server The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The state changed because the local replica has completed processing Online command from Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 06:21:28.500 spid20s The availability group database “JBDB” is changing roles from “RESOLVING” to “PRIMARY” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
-> Replica JBSAG2 becomes primary as shown in above logs.
2021-02-28 06:21:28.940 spid73s Recovery of database ‘JBDB’ (5) is 69% complete (approximately 26 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:22:08.480 spid73s Recovery of database ‘JBDB’ (5) is 17% complete (approximately 473 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:22:22.020 spid21s A connection for availability group ‘JBSWIKI’ from availability replica ‘JBSAG2’ with id [531FB0D2-4930-4E75-B1FE-2875DCCB9FFB] to ‘JBSAG1’ with id [1B5043DB-0107-4299-A2BA-9D156E056542] has been successfully established. This is an informational message only. No user action is required.
2021-02-28 06:22:28.480 spid73s Recovery of database ‘JBDB’ (5) is 36% complete (approximately 207 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:22:48.480 spid73s Recovery of database ‘JBDB’ (5) is 72% complete (approximately 51 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:08.490 spid73s Recovery of database ‘JBDB’ (5) is 99% complete (approximately 2 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:08.660 spid73s 12 transactions rolled back in database ‘JBDB’ (5:0). This is an informational message only. No user action is required.
2021-02-28 06:23:08.660 spid73s Recovery completed for database JBDB (database ID 5) in 100 second(s) (analysis 0 ms, redo 0 ms, undo 99714 ms [system undo 53953 ms, regular undo 45747 ms].) This is an informational message only. No user action is required.
-> Database JBDB becomes active as per above messages around 28 Feb 2021 06:23:08.
-> Lets check the performance monitor logs on Primary replica JBSAG2 and Secondary replica JBSAG1.

-> Recovery Queue – Total number of hardened log in kilobytes that is waiting to be redone on the secondary.
-> Redone Bytes/Sec – Amount of log records redone in the last second to catch up the database replica.
-> Log Remaining for undo – The amount of log in kilobytes remaining to finish the undo phase.
-> From the performance monitor logs, it is pretty clear that we have spent most of our time during undo in secondary database in replica JBSAG1. Log Remaining for undo decreases as shown in the screenshot above and reaches to 0 around 6:39. This is the time when the secondary database comes online as per details in SQL Server error log. Hence it is pretty clear that a large transaction (Rebuild Index in our case) that is terminated due to failover has caused this issue.
-> Log Remaining for undo can be used to understand how much log in KB is remaining before the database can come to synchronized state from “Reverting \ In Recovery” state.
Thank You,
Vivek Janakiraman
Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.