Slow query performance with Table Spool operator

-> Table Spool operator is used to store a copy of data in a worktable that can be used at a later stage in same or another part of the execution plan without having to produce the same data by calling its child operator all over again.

-> Table spool operator in SQL Server Management Studio graphical execution plan looks as below,

-> Table spool operators helps query performance most of the time by storing data in a worktable without requiring SQL Server to recompute the required data multiple times.

-> We will discuss a scenario where this operator slows down a query execution,

Create a database called JBSWIKI,

create database JBSWIKI

Create required objects,

use JBSWIKI
GO

create table Table1(
COL1 int identity(1,1),
Col2 nvarchar(max))

create table Table2(
COL1 int identity(1,1),
Col2 nvarchar(max))

create table Table3(
COL1 int identity(1,1),
Col2 nvarchar(max))

Populate the objects,

set nocount on

insert into Table1 
SELECT '+' + convert(varchar(15),convert(bigint,RAND() * 10000000000))
GO 500000

insert into Table2
SELECT '+' + convert(varchar(15),convert(bigint,RAND() * 10000000000))
GO 50

insert into Table3
SELECT '+' + convert(varchar(15),convert(bigint,RAND() * 10000000000))
GO 120

Execute below query,

set statistics time,io on
select * into #Temp_Table from Table1 where Col2 not in (select SUBSTRING(Col2,CHARINDEX(':',Col2)+1,len(Col2)-CHARINDEX(':',Col2)) from [dbo].Table2)
and Col2 not in (select SUBSTRING(Col2,CHARINDEX(':',Col2)+1,len(Col2)-CHARINDEX(':',Col2)) from [dbo].Table3)
drop table #Temp_Table
set statistics time,io off

-> As per below details this query took 6.8 seconds duration and utilized 35439 MS CPU. Logical reads performed on Table3 and Worktable is around 500009 and 1000792 respectively.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table ‘Table3‘. Scan count 17, logical reads 500009, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Worktable‘. Scan count 8, logical reads 1000792, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Table2’. Scan count 8, logical reads 8, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Table1’. Scan count 9, logical reads 2514, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times:
CPU time = 35439 ms, elapsed time = 6851 ms.

-> From the execution plan, Table spool operator stands out when compared to other operators. The logical reads performed for Worktable is comparatively very high.

-> Zooming out specifically to the Spool operator in the execution plan,

-> Object Table3 has 120 rows. Table spool operator estimates and processes 25 Million rows and this explains high logical reads for Worktable.

-> Lets try removing the table spool operator from execution plan using trace flag 8690 as advised in this article.


set statistics time,io on
select * into #Temp_Table from Table1 where Col2 not in (select SUBSTRING(Col2,CHARINDEX(':',Col2)+1,len(Col2)-CHARINDEX(':',Col2)) from [dbo].Table2)
and Col2 not in (select SUBSTRING(Col2,CHARINDEX(':',Col2)+1,len(Col2)-CHARINDEX(':',Col2)) from [dbo].Table3) OPTION (QUERYTRACEON 8690)
drop table #Temp_Table
set statistics time,io off

-> Table spool operator is no longer present. But the resource utilization is really high. This query now takes 2 Minutes 16 Seconds duration and 941780 MS CPU.

Table ‘Table3‘. Scan count 8, logical reads 500000, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Table2‘. Scan count 8, logical reads 500000, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Table1’. Scan count 9, logical reads 2514, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times:
CPU time = 941780 ms, elapsed time = 136280 ms.

-> Removing Table Spool operator from execution plan using trace flag 8690 did not help.

-> I rewrote the query to use a JOIN instead of NOT IN. Below are the results,

set statistics time,io on
select a.* into #Temp_Table from [dbo].Table1 a LEFT OUTER JOIN [dbo].Table2 b ON a.Col2 = SUBSTRING(b.Col2,CHARINDEX(':',b.Col2)+1,len(b.Col2)-CHARINDEX(':',b.Col2))
LEFT OUTER JOIN [dbo].Table3 c ON a.Col2 =SUBSTRING(c.Col2,CHARINDEX(':',c.Col2)+1,len(c.Col2)-CHARINDEX(':',c.Col2)) 
drop table #Temp_Table
set statistics time,io off

-> Table spool operator is not present. The resource utilization has completely come down. This query now takes just 960 MS duration and 1 logical reads on object Table3. I dont see any logical reads for worktable at all.

Table ‘Table3‘. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Table2’. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Table1’. Scan count 9, logical reads 2514, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

SQL Server Execution Times:
CPU time = 4500 ms, elapsed time = 960 ms.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Always On – One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss).

Environment

-> Database server JBSAG1 and JBSAG2 are located in East US.

-> Database servers JBSAG1 and JBSAG2 are part of failover cluster without shared storage.

-> Database Server JBSAG1 which is the primary replica goes down around 28 Feb 2021 14:27:13.060. Application team advised that their application had issue connecting to the database JBDB after JBSAG1 went down.

-> Upon checking, Availability group status is “Resolving” and below is the status of replica JBSAG2,

-> In the meantime, Database server JBSAG1 comes online and Availability group comes online on JBSAG1 again. Application team were able to access the database without any issues. So it is clear that database JBDB as part of availability group JBSWIKI was not accessible for 24 minutes between 28 Feb 2021 14:27:13.060 to 28 Feb 2021 14:52:41.920

-> We need to understand the reason why Availability group did not failover automatically to Synchronous secondary replica JBSAG2 when database server JBSAG1 went down.

-> Checking SQL Server error log,

JBSAG1JBSAG2
2021-02-28 14:26:59.400 spid9s Always On: The availability replica manager is going offline because SQL Server is shutting down. This is an informational message only. No user action is required.
2021-02-28 14:26:59.400 Server The availability group 'JBSWIKI' is being asked to stop the lease renewal because the availability group is going offline. This is an informational message only. No user action is required.
2021-02-28 14:26:59.400 spid9s Always On: The local replica of availability group 'JBSWIKI' is stopping. This is an informational message only. No user action is required.
2021-02-28 14:26:59.400 spid9s SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
2021-02-28 14:26:59.520 spid35s The Database Mirroring endpoint has stopped listening for connections.
2021-02-28 14:26:59.520 spid35s Service Broker manager has shut down.
2021-02-28 14:27:09.510 spid9s .NET Framework runtime has been stopped.
2021-02-28 14:27:13.060 spid9s SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.


-> As per above logs, it is clear that Availability Group goes offline due to database server/SQL server instance going down around 28 Feb 2021 14:26:59.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

2021-02-28 14:51:18.480 spid25s Starting up database 'JBDB'.
2021-02-28 14:51:18.480 spid26s Starting up database 'msdb'.
2021-02-28 14:51:18.480 spid27s Always On: The availability replica manager is starting. This is an informational message only. No user action is required.
2021-02-28 14:51:18.480 spid11s Starting up database 'mssqlsystemresource'.
2021-02-28 14:51:18.480 spid27s Always On Availability Groups: Waiting for local Windows Server Failover Clustering node to start. This is an informational message only. No user action is required.
2021-02-28 14:51:18.480 spid27s Always On Availability Groups: Local Windows Server Failover Clustering node started. This is an informational message only. No user action is required.
2021-02-28 14:51:18.480 spid27s Always On Availability Groups: Waiting for local Windows Server Failover Clustering node to come online. This is an informational message only. No user action is required.
2021-02-28 14:51:18.480 spid27s Always On Availability Groups: Local Windows Server Failover Clustering node is online. This is an informational message only. No user action is required.
2021-02-28 14:51:18.480 spid25s Skipping the default startup of database 'JBDB' because the database belongs to an availability group (Group ID: 65536). The database will be started by the availability group. This is an informational message only. No user action is required.
2021-02-28 14:51:18.480 spid27s Always On: The availability replica manager is waiting for the instance of SQL Server to allow client connections. This is an informational message only. No user action is required.
2021-02-28 14:51:18.480 spid27s Always On: The local replica of availability group 'JBSWIKI' is starting. This is an informational message only. No user action is required.
2021-02-28 14:51:18.490 spid27s The state of the local availability replica in availability group 'JBSWIKI' has changed from 'NOT_AVAILABLE' to 'RESOLVING_NORMAL'. The state changed because the local instance of SQL Server is starting up. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 14:51:18.500 spid11s The resource database build version is 15.00.4073. This is an informational message only. No user action is required.
2021-02-28 14:51:18.540 spid11s Starting up database 'model'.
2021-02-28 14:51:18.560 spid51 Error: 46906, Severity: 16, State: 1.
2021-02-28 14:51:18.560 spid51 Unable to retrieve registry value 'NodeRole' from Windows registry key 'Software\Microsoft\Microsoft SQL Server\MSSQL15.MSSQLSERVER\Polybase\Configuration': (null).
2021-02-28 14:51:18.700 spid52 Always On: The local replica of availability group 'JBSWIKI' is preparing to transition to the primary role. This is an informational message only. No user action is required.
2021-02-28 14:51:18.720 spid52 The state of the local availability replica in availability group 'JBSWIKI' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The state changed because the availability group is coming online. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.

.

.
2021-02-28 14:51:18.750 Server The state of the local availability replica in availability group 'JBSWIKI' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The state changed because the local replica has completed processing Online command from Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 14:51:18.760 spid25s DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: 531FB0D2-4930-4E75-B1FE-2021-02-28 14:51:18.790 spid25s Starting up database 'JBDB'.
2021-02-28 14:51:19.470 spid25s Parallel redo is started for database 'JBDB' with worker pool size [4].
2021-02-28 14:51:21.470 spid25s DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: 531FB0D2-4930-4E75-B1FE-2875DCCB9FFB:1
2021-02-28 14:51:21.480 spid25s DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: 531FB0D2-4930-4E75-B1FE-2875DCCB9FFB:1
2021-02-28 14:51:21.480 spid25s Recovery of database 'JBDB' (5) is 0% complete (approximately 3276 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
.

2021-02-28 14:51:31.560 spid9s Recovery is complete. This is an informational message only. No user action is required.
2021-02-28 14:51:41.490 spid25s Recovery of database 'JBDB' (5) is 10% complete (approximately 191 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 14:51:41.840 spid68s A connection for availability group 'JBSWIKI' from availability replica 'JBSAG1' with id [1B5043DB-0107-4299-A2BA-9D156E056542] to 'JBSAG2' with id [531FB0D2-4930-4E75-B1FE-2875DCCB9FFB] has been successfully established. This is an informational message only. No user action is required.
2021-02-28 14:52:01.500 spid25s Recovery of database 'JBDB' (5) is 19% complete (approximately 171 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 14:52:21.510 spid25s Recovery of database 'JBDB' (5) is 29% complete (approximately 152 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 14:52:41.100 spid25s Recovery of database 'JBDB' (5) is 38% complete (approximately 131 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 14:52:41.140 spid25s Recovery of database 'JBDB' (5) is 38% complete (approximately 131 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 14:52:41.910 spid66s 12 transactions rolled back in database 'JBDB' (5:0). This is an informational message only. No user action is required.
2021-02-28 14:52:41.910 spid66s Recovery is writing a checkpoint in database 'JBDB' (5). This is an informational message only. No user action is required.
2021-02-28 14:52:41.920 spid66s Recovery completed for database JBDB (database ID 5) in 83 second(s) (analysis 1975 ms, redo 79613 ms, undo 764 ms [system undo 140 ms, regular undo 604 ms].) This is an informational message only. No user action is required.


-> Availability group JBSWIKI and database JBDB is online as per above logs.
2021-02-28 14:26:59.520 spid29s Always On: The local replica of availability group 'JBSWIKI' is going offline because the corresponding resource in the Windows Server Failover Clustering (WSFC) cluster is no longer online. This is an informational message only. No user action is required.
2021-02-28 14:26:59.520 spid29s The state of the local availability replica in availability group 'JBSWIKI' has changed from 'SECONDARY_NORMAL' to 'RESOLVING_NORMAL'. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 14:26:59.520 spid65s The availability group database "JBDB" is changing roles from "SECONDARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2021-02-28 14:26:59.520 spid65s State information for database 'JBDB' - Hardened Lsn: '(1558:871882:1)' Commit LSN: '(1558:175123:24)' Commit Time: 'Feb 28 2021 2:24PM'
2021-02-28 14:26:59.530 spid62s Always On Availability Groups connection with primary database terminated for secondary database 'JBDB' on the availability replica 'JBSAG1' with Replica ID: {1b5043db-0107-4299-a2ba-9d156e056542}. This is an informational message only. No user action is required.
2021-02-28 14:27:16.380 spid67 Always On: The local replica of availability group 'JBSWIKI' is preparing to transition to the primary role. This is an informational message only. No user action is required.
2021-02-28 14:27:16.380 spid67 The availability replica for availability group 'JBSWIKI' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.
 


-> Above logs clearly mentions that a failover attempt was performed. But JBSAG2 cannot become primary because one or more databases were not Synchronized.


2021-02-28 14:27:17.290 spid58 Always On: The local replica of availability group ‘JBSWIKI’ is preparing to transition to the resolving role. This is an informational message only. No user action is required.
2021-02-28 14:27:17.570 spid58 Always On: The local replica of availability group ‘JBSWIKI’ is preparing to transition to the primary role. This is an informational message only. No user action is required.
2021-02-28 14:27:17.580 spid58 The availability replica for availability group ‘JBSWIKI’ on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.
2021-02-28 14:27:18.500 spid58 Always On: The local replica of availability group ‘JBSWIKI’ is preparing to transition to the resolving role. This is an informational message only. No user action is required.
. . . 2021-02-28 14:27:18.500 spid77       Always On: The local replica of availability group ‘JBSWIKI’ is preparing to transition to the primary role. This is an informational message only. No user action is required.

.
.
.

.
.
.
.
.
.

.
.

.
.

.
.

.
.
.

.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

2021-02-28 14:50:08.230 spid77       The availability replica for availability group ‘JBSWIKI’ on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.
2021-02-28 14:50:09.150 spid77       Always On: The local replica of availability group ‘JBSWIKI’ is preparing to transition to the resolving role. This is an informational message only. No user action is required.  

-> Message “The availability replica for availability group ‘JBSWIKI’ on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group.” is present from 28 Feb 2021 14:27:16 to 14:50:09.                  

2021-02-28 14:51:18.730 spid29s      The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’.  The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC).  For more information, see the SQL Server error log or cluster log.  If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 14:51:18.730 spid71s      The availability group database “JBDB” is changing roles from “RESOLVING” to “SECONDARY” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.  

-> JBSAG2 becomes secondary as per above logs on 28 Feb 2021 14:51:18.



















































































-> From above log analysis, It is pretty clear that the automatic failover of Availability group from JBSAG1 to JBSAG2 did not complete due to below message,

The availability replica for availability group 'JBSWIKI' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.

-> Automatic failover of Availability group JBSWIKI did not complete as availability database JBDB was not in SYNCHRONIZED state between JBSAG1 and JBSAG2. The Availability database should be synchronized to avoid any potential data loss during failovers.

-> The database JBDB may not be synchronized due to a large transaction that was happening on the Primary replica when the failover was performed or the secondary synchronous replica was not available due to a network issue or some other reason. In our case the huge transaction was an Index rebuild operation on a very huge Index when failover was attempted when Primary replica JBSAG1 went down.

-> There is nothing much we can do in these scenarios other than waiting for the primary replica to be online or perform a forced failover with data loss.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Always On Secondary Database in “Reverting \ In Recovery” state

Environment

-> Database server JBSAG1 and JBSAG2 are located in East US.

-> Database servers JBSAG1 and JBSAG2are part of failover cluster without shared storage.

-> Below is the status of Always On availability group currently,

-> Below performance monitor logs are currently executing on JBSAG1 and JBSAG2,

-> Rebuild index is executing on JBSAG1.

-> JBSAG1 goes down. An automatic failover of Availability group JBSWIKI happens from JBSAG1 to JBSAG2.

-> Database Server JBSAG1 comes online. JBSAG2 becomes current primary and JBSAG1 becomes secondary.

-> Checking the status of Always On availability group,

-> On Secondary JBSAG1, database JBDB is in “Reverting / In Recovery” State.

-> Executing below query on secondary replica JBSAG1,

select @@servername as SQLName,db_name(database_id) as DBName,synchronization_state_desc,database_state_desc
from sys.dm_hadr_database_replica_states where is_local=1

-> Executing same query on primary replica JBSAG2,

-> Below are the messages found in Always On Dashboard,

The availability group is not ready for automatic failover. The primary replica and a secondary replica are configured for automatic failover, however, the secondary replica is not ready for an automatic failover. Possibly the secondary replica is unavailable, or its data synchronization state is currently not in the SYNCHRONIZED synchronization state.

In this availability group, at least one secondary replica has a NOT SYNCHRONIZING synchronization state and is not receiving data from the primary replica.

In this availability group, at least one synchronous replica is not currently synchronized. The replica synchronization state could be either SYNCHONIZING or NOT SYNCHRONIZING.

-> Checking SQL Server errorlog,

JBSAG1

2021-02-28 06:21:25.340 spid9s Always On: The availability replica manager is going offline because SQL Server is shutting down. This is an informational message only. No user action is required.

-> Availability group is going offline as per above message.
2021-02-28 06:21:25.340 spid9s Always On: The local replica of availability group ‘JBSWIKI’ is stopping. This is an informational message only. No user action is required.
2021-02-28 06:21:25.340 Server The availability group ‘JBSWIKI’ is being asked to stop the lease renewal because the availability group is going offline. This is an informational message only. No user action is required.
2021-02-28 06:21:25.340 spid9s SQL Server is terminating in response to a ‘stop’ request from Service Control Manager. This is an informational message only. No user action is required.
2021-02-28 06:21:25.450 spid25s The Database Mirroring endpoint has stopped listening for connections.
2021-02-28 06:21:25.450 spid25s Service Broker manager has shut down.
2021-02-28 06:21:35.450 spid9s .NET Framework runtime has been stopped.
2021-02-28 06:21:38.110 spid9s SQL Trace was stopped due to server shutdown. Trace ID = ‘1’. This is an informational message only; no user action is required.
2021-02-28 06:21:38.710 spid19s The SQL Server Network Interface library successfully deregistered the Service Principal Name (SPN) [ MSSQLSvc/JBSAG1.JBS.COM:1433 ] for the SQL Server service.

-> SQL Services comes online on JBSAG1 and below are the logs,

2021-02-28 06:22:11.890 spid25s Always On Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.
2021-02-28 06:22:11.890 spid29s Starting up database ‘JBDB’.
2021-02-28 06:22:11.890 spid11s Starting up database ‘mssqlsystemresource’.
2021-02-28 06:22:11.890 spid25s Always On Availability Groups: Local Windows Server Failover Clustering service started. This is an informational message only. No user action is required.

2021-02-28 06:22:11.890 spid25s Always On Availability Groups: Waiting for local Windows Server Failover Clustering node to start. This is an informational message only. No user action is required.
2021-02-28 06:22:11.890 spid25s Always On Availability Groups: Local Windows Server Failover Clustering node started. This is an informational message only. No user action is required.


-> Windows cluster service is online.

2021-02-28 06:22:11.900 spid29s Skipping the default startup of database ‘JBDB’ because the database belongs to an availability group (Group ID: 65536). The database will be started by the availability group. This is an informational message only. No user action is required.
2021-02-28 06:22:11.900 spid25s Always On: The availability replica manager is waiting for the instance of SQL Server to allow client connections. This is an informational message only. No user action is required.
2021-02-28 06:22:11.900 spid25s Always On: The local replica of availability group ‘JBSWIKI’ is starting. This is an informational message only. No user action is required.
2021-02-28 06:22:11.910 spid25s The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘NOT_AVAILABLE’ to ‘RESOLVING_NORMAL’. The state changed because the local instance of SQL Server is starting up. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 06:22:11.940 spid11s The resource database build version is 15.00.4073. This is an informational message only. No user action is required.
2021-02-28 06:22:11.940 spid31s The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.


-> Replica JBSAG1 becomes secondary.

2021-02-28 06:22:12.530 spid32s Parallel redo is started for database ‘JBDB’ with worker pool size [4].
2021-02-28 06:22:12.530 spid32s Recovery of database ‘JBDB’ (5) is 0% complete (approximately 3464 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.


-> Recovery starts as per above message.

2021-02-28 06:22:21.950 spid33s A connection timeout has occurred while attempting to establish a connection to availability replica ‘JBSAG2’ with id [531FB0D2-4930-4E75-B1FE-2875DCCB9FFB]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
2021-02-28 06:22:22.020 spid25s A connection for availability group ‘JBSWIKI’ from availability replica ‘JBSAG1’ with id [1B5043DB-0107-4299-A2BA-9D156E056542] to ‘JBSAG2’ with id [531FB0D2-4930-4E75-B1FE-2875DCCB9FFB] has been successfully established. This is an informational message only. No user action is required.
2021-02-28 06:22:32.540 spid32s Recovery of database ‘JBDB’ (5) is 10% complete (approximately 180 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:22:52.540 spid32s Recovery of database ‘JBDB’ (5) is 20% complete (approximately 159 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:12.540 spid32s Recovery of database ‘JBDB’ (5) is 30% complete (approximately 139 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:13.770 spid32s Recovery of database ‘JBDB’ (5) is 30% complete (approximately 138 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:13.780 spid32s Always On Availability Groups connection with primary database established for secondary database ‘JBDB’ on the availability replica ‘JBSAG2’ with Replica ID: {531fb0d2-4930-4e75-b1fe-2875dccb9ffb}. This is an informational message only. No user action is required.

-> Database JBDB is restarted on JBSAG1 to synchronize with primary.

2021-02-28 06:39:31.040 spid32s Error: 35278, Severity: 17, State: 1.
2021-02-28 06:39:31.040 spid32s Availability database ‘JBDB’, which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.
2021-02-28 06:39:31.040 spid32s Parallel redo is shutdown for database ‘JBDB’ with worker pool size [4].

2021-02-28 06:39:31.650 spid32s Starting up database ‘JBDB’.
2021-02-28 06:39:35.240 spid32s Parallel redo is started for database ‘JBDB’ with worker pool size [4].
2021-02-28 06:39:35.270 spid32s Recovery of database ‘JBDB’ (5) is 0% complete (approximately 608 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:39:43.800 spid32s Recovery of database ‘JBDB’ (5) is 25% complete (approximately 26 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2021-02-28 06:39:44.200 spid32s Always On Availability Groups connection with primary database established for secondary database ‘JBDB’ on the availability replica ‘JBSAG2’ with Replica ID: {531fb0d2-4930-4e75-b1fe-2875dccb9ffb}. This is an informational message only. No user action is required.
2021-02-28 06:39:44.200 spid32s The recovery LSN (1558:4889:2) was identified for the database with ID 5. This is an informational message only. No user action is required.
2021-02-28 06:39:44.210 spid32s Recovery completed for database JBDB (database ID 5) in 13 second(s) (analysis 25 ms, redo 8533 ms, undo 0 ms [system undo 222122029433 ms, regular undo 217368571615 ms].) This is an informational message only. No user action is required.
2021-02-28 06:39:44.220 spid46s Always On Availability Groups connection with primary database established for secondary database ‘JBDB’ on the availability replica ‘JBSAG2’ with Replica ID: {531fb0d2-4930-4e75-b1fe-2875dccb9ffb}. This is an informational message only. No user action is required.
2021-02-28 06:39:44.220 spid46s The recovery LSN (1558:4889:2) was identified for the database with ID 5. This is an informational message only. No user action is required.

From above logs, below details are clear.

-> Failover was initiated at 28 Feb 2021 06:21:25.

-> Secondary database on JBSAG1 is available and synchronized with primary around 28 Feb 2021 06:39:44.

-> Secondary database JBDB on JBSAG1 synchronized with Primary after 18 + minutes. It seems like most of the time was spent in “Reverting \ In Recovery”.

-> The access to readable secondary database JBDB will have issues till the state is “Reverting \ In Recovery”.

-> The rollback on secondary replica should complete before the synchronization process from Primary can start. Until then the secondary database can not be accessible.

-> Lets check the error logs in JBSAG2.

JBSAG2

2021-02-28 06:21:25.410 spid47s Always On: The local replica of availability group ‘JBSWIKI’ is going offline because the corresponding resource in the Windows Server Failover Clustering (WSFC) cluster is no longer online. This is an informational message only. No user action is required.
2021-02-28 06:21:25.410 spid47s The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 06:21:25.410 spid20s The availability group database “JBDB” is changing roles from “SECONDARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2021-02-28 06:21:25.410 spid20s State information for database ‘JBDB’ – Hardened Lsn: ‘(1557:127872:1)’ Commit LSN: ‘(1557:126710:63)’ Commit Time: ‘Feb 28 2021 6:20AM’
2021-02-28 06:21:25.710 spid82s Always On Availability Groups connection with primary database terminated for secondary database ‘JBDB’ on the availability replica ‘JBSAG1’ with Replica ID: {1b5043db-0107-4299-a2ba-9d156e056542}. This is an informational message only. No user action is required.
2021-02-28 06:21:26.730 spid131 Always On: The local replica of availability group ‘JBSWIKI’ is preparing to transition to the primary role. This is an informational message only. No user action is required.
2021-02-28 06:21:27.350 spid131 The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The state changed because the availability group is coming online. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 06:21:27.350 Server The lease worker of availability group ‘JBSWIKI’ is now sleeping the excess lease time (164250 ms) supplied during online. This is an informational message only. No user action is required.
2021-02-28 06:21:28.500 Server The state of the local availability replica in availability group ‘JBSWIKI’ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The state changed because the local replica has completed processing Online command from Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2021-02-28 06:21:28.500 spid20s The availability group database “JBDB” is changing roles from “RESOLVING” to “PRIMARY” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

-> Replica JBSAG2 becomes primary as shown in above logs.

2021-02-28 06:21:28.940 spid73s Recovery of database ‘JBDB’ (5) is 69% complete (approximately 26 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:22:08.480 spid73s Recovery of database ‘JBDB’ (5) is 17% complete (approximately 473 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:22:22.020 spid21s A connection for availability group ‘JBSWIKI’ from availability replica ‘JBSAG2’ with id [531FB0D2-4930-4E75-B1FE-2875DCCB9FFB] to ‘JBSAG1’ with id [1B5043DB-0107-4299-A2BA-9D156E056542] has been successfully established. This is an informational message only. No user action is required.
2021-02-28 06:22:28.480 spid73s Recovery of database ‘JBDB’ (5) is 36% complete (approximately 207 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:22:48.480 spid73s Recovery of database ‘JBDB’ (5) is 72% complete (approximately 51 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:08.490 spid73s Recovery of database ‘JBDB’ (5) is 99% complete (approximately 2 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2021-02-28 06:23:08.660 spid73s 12 transactions rolled back in database ‘JBDB’ (5:0). This is an informational message only. No user action is required.
2021-02-28 06:23:08.660 spid73s Recovery completed for database JBDB (database ID 5) in 100 second(s) (analysis 0 ms, redo 0 ms, undo 99714 ms [system undo 53953 ms, regular undo 45747 ms].) This is an informational message only. No user action is required.

-> Database JBDB becomes active as per above messages around 28 Feb 2021 06:23:08.

-> Lets check the performance monitor logs on Primary replica JBSAG2 and Secondary replica JBSAG1.

-> Recovery Queue – Total number of hardened log in kilobytes that is waiting to be redone on the secondary.

-> Redone Bytes/Sec – Amount of log records redone in the last second to catch up the database replica.

-> Log Remaining for undo – The amount of log in kilobytes remaining to finish the undo phase.

-> From the performance monitor logs, it is pretty clear that we have spent most of our time during undo in secondary database in replica JBSAG1. Log Remaining for undo decreases as shown in the screenshot above and reaches to 0 around 6:39. This is the time when the secondary database comes online as per details in SQL Server error log. Hence it is pretty clear that a large transaction (Rebuild Index in our case) that is terminated due to failover has caused this issue.

-> Log Remaining for undo can be used to understand how much log in KB is remaining before the database can come to synchronized state from “Reverting \ In Recovery” state.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.