SQL resources not coming online after patching

-> I had a requirement to install service pack on a 2-node cluster server. It was an active-passive server.

-> I installed the service pack on the passive node successfully. I tried failing over the SQL services from Active node to Passive node and it failed.

-> Checked the cluster.log and found the below errors,

0001164.00001b70::2016/12/01-05:10:32.854 WARN [ClNet] Tcpip is not bound to adapter E8BF15A3-3617-4818-94E4-7BCC37A3B45D.
000014dc.00000868::2016/12/01-05:12:41.025 ERR  SQL Server <SQL Server>: [sqsrvres] OnlineThread: SQLClusterResourceWorkerWaitForCompletion failed (status 0)
0000101c.00001078::2016/12/01-05:12:41.025 WARN [FM] FmpHandleResourceTransition: Resource Name = 3c028454-7dc4-4943-8bb9-89712945d2f2 [SQL Server] old state=129 new state=4
0000101c.00001078::2016/12/01-05:12:41.025 WARN [FM] FmpHandleResourceTransition: Resource failed, post a work item
000014dc.00001b1c::2016/12/01-05:14:43.117 ERR  SQL Server <SQL Server>: [sqsrvres] OnlineThread: SQLClusterResourceWorkerWaitForCompletion failed (status 0)
0000101c.00001078::2016/12/01-05:14:43.117 WARN [FM] FmpHandleResourceTransition: Resource Name = 3c028454-7dc4-4943-8bb9-89712945d2f2 [SQL Server] old state=129 new state=4
0000101c.00001078::2016/12/01-05:14:43.117 WARN [FM] FmpHandleResourceTransition: Resource failed, post a work item
0000101c.00001078::2016/12/01-05:14:43.117 WARN [FM] Group failure for group <56e7dc42-6936-4227-9ed7-eea24233be74>. Create thread to take offline and move.
00001164.000013ac::2016/12/01-05:14:44.086 WARN Physical Disk <R: >: Offline, Locking volume failed, error 5.
00001164.000013ac::2016/12/01-05:14:45.164 WARN Physical Disk <R: >: Offline, Locking volume failed, error 5.
0000101c.00001028::2016/12/01-05:16:20.834 WARN [INIT] The cluster service is shutting down.
0000101c.00001028::2016/12/01-05:16:20.834 WARN [FM] Shutdown: Failover Manager requested to shutdown groups.

-> Eventviewer log contains the error as below,

 OnlineThread: SQLClusterResourceWorkerWaitForCompletion failed (status 0)

-> Tried starting the SQL service from services.msc. The SQL server came online after script upgrade mode. Connected the SQL server instance from SQL server management studio and the version changed as appropriate.

-> Enabled verbose logging for cluster by setting VerboseLogging to 1 in HKEY_LOCAL_MACHINE\Cluster\Groups\<GUID>\Parameters\VerboseLogging.

-> Started the SQL services again from cluadmin.msc. It failed and got the below message on Cluster.log.

00001184.000019d4::2016/12/04-07:20:30.771 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): Worker thread starting …
00001184.000019d4::2016/12/04-07:20:30.771 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): Entering SQLClusterSharedDataUpgradeWorker thread.
00001184.000019d4::2016/12/04-07:20:30.771 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): SqlDataRoot = ‘R:\MSSQL10.MSSQLSERVER\MSSQL’,         CurrentPatchLevel = ‘10.3.5500.0’,         SharedDataPatchLevel = ‘10.1.2531.0’
00001184.000019d4::2016/12/04-07:20:30.771 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): FTDataPath = ‘R:\MSSQL10.MSSQLSERVER\MSSQL\FTData’,         SQLGroup = ‘S-1-5-21-1897246555-1303978887-2451887125-1005’
00001184.000019d4::2016/12/04-07:20:30.771 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): Entering DoSQLDataRootApplyACL (R:\MSSQL10.MSSQLSERVER\MSSQL\Data).
00001184.000019d4::2016/12/04-07:20:30.771 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): ACL-ing (R:\MSSQL10.MSSQLSERVER\MSSQL\Data) as full-control to SQLGroup.
00001184.000019d4::2016/12/04-07:20:30.787 WARN SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): DoSQLDataRootApplyACL : Failed to apply security to SQLDataRoot (122).
00001184.000019d4::2016/12/04-07:20:30.787 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): Exiting DoSQLDataRootApplyACL (0).
00001184.000019d4::2016/12/04-07:20:30.787 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): Entering DoIFTSSharedDataUpgrade.
00001184.00001e54::2016/12/04-07:20:30.975 INFO SQL Server <SQL Server>: [sqsrvres] Worker Thread (F2DF70): Calling SQLClusterResourceWorker::WaitForCompletion (200)

-> The cause seems to be that the SQL resource DLL detects the installation of a patch during the first failover to the newly patched node. Thus it spawns a worker thread to update the databases accordingly (e.g. create updated system stored procedures and views, etc.). As a first step this worker thread will re-create the NTFS file permissions on the data folders of this SQL instance. As the resource DLL runs in the process RHS.EXE (Failover Cluster Resource Host Subsystem), it runs with the credentials of NT AUTHORITY\SYSTEM.

-> Provided full control to SYSTEM, cluster account, SQL service account to below location.

R:\MSSQL10.MSSQLSERVER\MSSQL\Data
R:\MSSQL10.MSSQLSERVER\MSSQL\
R:\

-> Tried running icacls R:\MSSQL10.MSSQLSERVER\MSSQL\Data, icacls R:\MSSQL10.MSSQLSERVER\MSSQL and icacls R:\ and was able to see that all the above account were having full permission.

-> Started a procmon and failed over the SQL services from active node to passive node. The SQL services failed in cluadmin.msc, but still clusterresourcemonitor was working with files on folder R:\MSSQL10.MSSQLSERVER\MSSQL\LOG.

-> It seems like the worker thread that recreates the NTFS file permissions on the log folder of this SQL instance is causing the issue.

-> The log folder was having 373275 files and was of size 23GB.

-> Tried renaming the folder R:\MSSQL10.MSSQLSERVER\MSSQL\LOG to R:\MSSQL10.MSSQLSERVER\MSSQL\LOG_old and created a dummy LOG folder.

-> Failed over the SQL services from active to passive node. SQL services came online without any issues.

-> It seems that during the first failover after a patch the worker thread while re-creating the NTFS file permissions on MSSQL folder had to move through huge lot of files and eventually timed out and that is the reason why this failed out earlier.

-> Patched the active node. Failover and failback worked fine.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s