SQL Server Setup – Could not allocate space for object ‘dbo.Large Object Storage System object: 422212467425280’ in database ‘tempdb’ because the ‘PRIMARY’ filegroup is full

-> I installed SQL Server 2019 and then tried applying Cumulative update 5 on top of it.

-> SQL Server 2019 CU5 failed with below error,

TITLE: SQL Server Setup failure.
SQL Server Setup has encountered the following error:
The NT service ‘SQLAgent$IN01’ could not be started.
Error code 0x84B20001.


For help, click: https://go.microsoft.com/fwlink?LinkID=20476&ProdName=Microsoft%20SQL%20Server&EvtSrc=setup.rll&EvtID=50000&EvtType=0xAE7B8BF3%25400xF762992B%25401202%25401

BUTTONS:
OK

-> SQL Server was down. Received below error when tried starting SQL Service manually from configuration manager,

SQL Server Configuration Manager

The request failed or the service did not respond in a timely fashion. Consult the event log or other applicable error logs for details.

OK

-> Checking SQL Server error log as the above error is very generic,

2020-08-02 14:51:26.78 Server Microsoft SQL Server 2019 (RTM-CU5) (KB4552255) – 15.0.4043.16 (X64)
Jun 10 2020 18:25:25
Copyright (C) 2019 Microsoft Corporation
Developer Edition (64-bit) on Windows Server 2016 Datacenter 10.0 (Build 14393: ) (Hypervisor)

-> SQL Server error log shows that SQL Server is upgraded to 15.0.4043.

-> Checking further in SQL Server error log,

2020-08-02 14:51:30.26 spid9s Database ‘master’ is upgrading script ‘msdb110_upgrade.sql’ from level 251660240 to level 251662283.
2020-08-02 14:51:30.26 spid9s ———————————-
2020-08-02 14:51:30.26 spid9s Starting execution of PRE_MSDB.SQL
2020-08-02 14:51:30.26 spid9s ———————————-
2020-08-02 14:51:30.42 spid9s Setting database option COMPATIBILITY_LEVEL to 100 for database ‘msdb’.
2020-08-02 14:51:30.46 spid9s —————————————–
2020-08-02 14:51:30.46 spid9s Starting execution of PRE_SQLAGENT100.SQL
2020-08-02 14:51:30.46 spid9s —————————————–
.
.
.
2020-08-02 14:51:35.25 spid9s Creating procedure sp_purge_jobhistory…
2020-08-02 14:51:35.29 Logon Error: 18401, Severity: 14, State: 1.
2020-08-02 14:51:35.29 Logon Login failed for user ‘JBSQL1\jvivek2k1’. Reason: Server is in script upgrade mode. Only administrator can connect at this time. [CLIENT: ]
2020-08-02 14:51:35.29 spid9s
2020-08-02 14:51:35.29 spid9s Creating procedure sp_help_jobhistory…
2020-08-02 14:51:35.29 Logon Error: 18401, Severity: 14, State: 1.
2020-08-02 14:51:35.29 Logon Login failed for user ‘JBSQL1\jvivek2k1’. Reason: Server is in script upgrade mode. Only administrator can connect at this time. [CLIENT: ]
.
.
.
2020-08-02 15:08:51.69 spid9s Error: 1105, Severity: 17, State: 2.
2020-08-02 15:08:51.69 spid9s Could not allocate space for object ‘dbo.Large Object Storage System object: 422212467294208’ in database ‘tempdb’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

2020-08-02 15:08:51.69 spid9s Error: 7399, Severity: 16, State: 1.
2020-08-02 15:08:51.69 spid9s The OLE DB provider “BULK” for linked server “(null)” reported an error. The provider did not give any information about the error.
2020-08-02 15:08:51.69 spid9s Error: 7330, Severity: 16, State: 2.
2020-08-02 15:08:51.69 spid9s Cannot fetch a row from OLE DB provider “BULK” for linked server “(null)”.
2020-08-02 15:08:51.69 spid9s Error: 912, Severity: 21, State: 2.
2020-08-02 15:08:51.69 spid9s Script level upgrade for database ‘master’ failed because upgrade step ‘msdb110_upgrade.sql’ encountered error 7330, state 2, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2020-08-02 15:08:51.70 spid9s Error: 3417, Severity: 21, State: 3.
2020-08-02 15:08:51.70 spid9s Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2020-08-02 15:08:51.73 spid9s SQL Trace was stopped due to server shutdown. Trace ID = ‘1’. This is an informational message only; no user action is required.

-> The script upgrade mode fails due to insufficient space in Tempdb.

-> I have added Trace flag 902 temporarily which stops script upgrade mode for troubleshooting purpose.

-> I started the SQL Services with trace flag 902 and checked the reason why tempdb was full.

-> I was able to see that autogrowth for Tempdb was disabled and the size was just 8 MB as I left the default value during SQL Server install.

-> It seems like I disabled the autogrowth value for tempdb, but forgot to increase the tempdb file size.

-> I enabled autogrowth for Tempdb files.

-> Removed Trace flag 902 from SQL Services and restarted SQL Services.

-> It did started fine this time and could login to SQL Server without any issues.

-> I checked SQL server error log and there were no errors.

-> Since the CU 5 install crashed with “Could not start SQL Server” error. I decided to repair the SQL server instance, just to make sure there are no issues.

-> SQL Server repair completed fine.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Always ON – Growing LOG SEND QUEUE on Asynchronous replica

Environment

-> Database server JBSAG2 and JBSAG1 are on primary datacentre located in East US.

-> Database server JBSAG3 is on secondary datacentre in Korea Central.

-> Database servers JBSAG1, JBSAG2 and JBSAG3 are part of failover cluster without shared storage.

-> The Log send queue on Asynchronous replica JBSAG3 was continuously growing as shown in below screenshot,

-> It was very clear that there was no issue on the Synchronous replica JBSAG1, as the log send queue was well under control.

-> On the other hand log send queue is huge on Asynchronous replica JBSAG3.

-> I have captured perfmon on database servers JBSAG1, JBSAG2 and JBSAG3. Below are the analysis,

Primary Replica -> JBSAG2

1) Primary JBSAG2 to its Log file

(SQLServer:Databases – Log Bytes Flushed/sec)  refers to how much primary is flushing per second to the log file. In this scenario, we are able to see that for databases JBDB, JBSDBDW and StackOverflow2010 Primary is flushing close to 5+ MB/sec maximum and 1 MB/sec Average to the log file.
We will have to sum Log Bytes Flushed/sec for each database to come to a final value.

2) Primary JBSAG2 to JBSAG1 via network

(SQLServer:Availability Replica -> Bytes Sent to Replica/Sec) refers to how much is sent across the network to the secondary replica. In our case we are able to see that maximum of 5+ MB/sec and 1+ MB/sec on an average sent to secondary JBSAG1 from Primary JBSAG2. This is same as SQLServer:Databases – Log Bytes Flushed/sec which is good.

3) Primary JBSAG2 to JBSAG3 via network

(SQLServer:Availability Replica -> Bytes Sent to Replica/Sec) refers to how much is sent across the network to the secondary replica. In our case we are able to see that maximum of 600+ KB/sec and 400+ KB/sec on an average sent to secondary JBSAG3 from Primary JBSAG2.

It is pretty clear that Secondary JBSAG1 receives data to the tune of 5+ MB/sec which is close to what the primary flushes, whereas Secondary JBSAG3 receives only 600 + KB/sec which is comparably very less.

Secondary Replica -> JBSAG1

1) Secondary JBSAG1 Received from Primary JBSAG2

(SQLServer:Database Replica -> Log Bytes Received/Sec) refers to how much data we receive in Secondary from Primary replica. In our case we are able to see that maximum of 5+ MB/sec and 500+ KB/sec on an average sent to secondary JBSAG1 from Primary JBSAG2. We will have to sum Log Bytes Received/sec for each database to come to a final value.

This is same as SQLServer:Databases – Log Bytes Flushed/sec in Primary, which is good.

2) Redoing received data on Secondary

(SQLServer:Database Replica->Redone Bytes/Sec) refers to how much data will be redone on secondary. In our case we are able to see that maximum of 5+ MB/sec and 500+ KB/sec on an average was redone on secondary JBSAG1. We will have to sum Redone Bytes/sec for each database to come to a final value.

This is same as Log Bytes Received/sec which is good.

Secondary Replica -> JBSAG3

1) Secondary JBSAG3 Received from Primary JBSAG2

(SQLServer:Database Replica -> Log Bytes Received/Sec) refers to how much data we receive in Secondary from Primary replica. In our case we are able to see that maximum of 1+ MB/sec and 200+ KB/sec on an average sent to secondary JBSAG3 from Primary JBSAG2. We will have to sum Log Bytes Received/sec for each database to come to a final value.

This is much lower than what we sent from primary as marked as SQLServer:Databases – Log Bytes Flushed/sec in Primary. We sent 5 MB/sec, but we received just 1 MB/sec. So there is a lag of 4+ MB/sec which is not good.

2) Redoing received data on Secondary

(SQLServer:Database Replica->Redone Bytes/Sec) refers to how much data will be redone on secondary. In our case we are able to see that maximum of 1+ MB/sec and 200+ KB/sec on an average was redone on secondary JBSAG3. We will have to sum Redone Bytes/sec for each database to come to a final value. We are redoing what we received from the primary. Redone Bytes/Sec is same as Log Bytes Received/Sec.

From the above analysis we understand below,

JBSAG2 -> Primary flushes 5 MB/sec to log file -> Primary sends 5 MB/sec to secondary via network.

JBSAG1 -> Secondary receives 5 MB/sec via network -> Secondary redo is also around 5 MB/sec -> Log send queue is around 120+ KB.

JBSAG3 -> Secondary receives 1 MB/sec via network -> Secondary redo is also around 1 MB/sec -> Log send queue is around 750+ MB.

From the above analysis it is clear that Primary JBSAG2 is able to send 5 MB/sec to JBSAG1. but only 1 MB/sec to JBSAG3 and this results in Log send queue buildup in JBSAG3.

-> It seems like it is a network issue between primary and secondary datacentre.

-> We tried a robocopy from JBSAG2 to JBSAG1 and JBSAG3 using a 10 GB single file.

robocopy c:\temp\ \\jbsag1\temp file.zip
robocopy c:\temp\ \\jbsag3\temp file.zip

-> Robocopy indicated speed of 23 MB/sec for copy between JBSAG2 to JBSAG1.

-> Robocopy exhibited speed of 1 MB/sec for copy between JBSAG2 to JBSAG3.

-> Robocopy tests were performed just to rule out any SQL Server issue and to advise network team that a normal copy operation also performs badly on JBSAG3.

-> I involved network and Platforms team to check this further.

-> Network team advised that the network link seems normal and there is no bandwidth issue. Network team advised Platforms team to check this further as it seems like a guest machine issue.

-> In the meantime, we found a different 3 node Always on setup same as this setup with same primary and secondary datacentre. The other 3 node Always On setup did not have this issue. We found that the Asynchronous secondary server in working setup was on a different host machine in the secondary datacentre than the problematic Asynchronous secondary server where the problem is visible.

-> Platforms team migrated the problematic Asynchronous secondary server to the host machine where the working setup’s Asynchronous secondary server was present.

-> This did not resolve the issue. Platforms team took a netsh trace from JBSAG2 to JBSAG1, JBSAG3.

-> The traces showed that during the three way handshake, the TCP window sizes are negotiated as no scaling factor negotiated. This lead to the suspicion with the TCP global parameters for auto tuning could be a cause of this issue. By default, this setting is enabled.

-> We checked this parameter using below command on servers JBSAG1, JBSAG2 and JBSAG3.

netsh interface tcp show global

-> TCP global parameter for auto tuning was in disabled state on JBSAG1 and JBSAG3.

-> TCP global parameter for auto tuning was enabled along with RSS using below command on JBSAG1 and JBSAG3,

netsh interface tcp set global rss=enabled autotuninglevel=Normal

-> Once these were enabled the transfer speeds increased and the TCP window size was negotiated to the highest.

-> This resolved our issue and the Log send queue size on JBSAG3 came under control.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Always ON – The connection to the primary replica is not active. The command cannot be processed. (Microsoft SQL Server, Error: 35250)

Environment

-> I was trying to add a new database to availability group an got below error,

TITLE: Microsoft SQL Server Management Studio

Failed to join the database ‘JBDB’ to the availability group ‘JBSAG’ on the availability replica ‘JBSAG2’. (Microsoft.SqlServer.Management.HadrModel)

For help, click: https://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=16.100.37971.0&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&LinkId=20476

ADDITIONAL INFORMATION:
An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

The connection to the primary replica is not active. The command cannot be processed. (Microsoft SQL Server, Error: 35250)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&ProdVer=14.00.3294&EvtSrc=MSSQLServer&EvtID=35250&LinkId=20476

BUTTONS:
OK

-> Executing below code to check if endpoint is started on all nodes in SQLCMD mode,

:Connect JBSAG1
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go
:Connect JBSAG2
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go
:Connect JBSAG3
select name, state_desc, port from sys.tcp_endpoints where type_desc='DATABASE_MIRRORING'
go

-> Endpoint on JBSAG2 is down.

-> I have started it with below code in SQLCMD mode,

:Connect JBSAG2
alter endpoint [Hadr_endpoint] state = started
go

-> Adding database to Availability group works fine without issues after above fix.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.