Rule “Valid DSN” and Rule “Valid Database compatibility level and successful connection” failed.

-> I was performing an In-Place upgrade from SQL Server 2014 to SQL Server 2016.

-> Below rules failed,

Blog91_1

-> Rule Check 1 -> Valid DSN

Blog91_2

—————————
Rule Check Result
—————————
Rule “Valid DSN” failed.

The report server configuration is not complete or is invalid. Use Reporting Services Configuration Manager to verify the report server configuration.
—————————
OK
—————————

-> Rule Check 2 -> Valid Database compatibility level and successful connection

Blog91_3
—————————
Rule Check Result
—————————
Rule “Valid Database compatibility level and successful connection” failed.

The report server database is not a supported compatibility level or a connection cannot be established. Use Reporting Services Configuration Manager to verify the report server configuration and SQL Server management tools to verify the compatibility level.
—————————
OK
—————————

-> SQL Services Reporting Services is installed, but not configured and it seems like thats the reason for this issue.

-> I configured SQL Services Reporting Services as below,

Blog91_4

Blog91_5

Blog91_6

Blog91_7

Blog91_8

Blog91_9

Blog91_10

Blog91_11

Blog91_12

-> Refresh the SQL Server 2014 rule window again and it will succeed this time,

Blog91_13

-> This allowed me to complete the upgrade without any further issue.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

 

Always On – Availability group not failing over automatically

-> Client applications not able to connect to the listener.

-> Environment setup is as below,

Backup_Setup

-> When I try connecting to the Listener using SQL Server management Studio (SSMS), I get below error,

Blog1_0

TITLE: Connect to Server
——————————
Cannot connect to JBAPP.
——————————
ADDITIONAL INFORMATION:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 2)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=2&LinkId=20476
——————————
The system cannot find the file specified
——————————
BUTTONS:
OK
——————————

-> I tried connecting to Database Server JBAG1 and JBAG2 manually using SSMS. Connection to JBAG1 worked, But connection to JBAG2 failed with below error,

Blog1_2

TITLE: Connect to Server
——————————
Cannot connect to JBAG2\IN01.
——————————
ADDITIONAL INFORMATION:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 – The wait operation timed out.) (Microsoft SQL Server, Error: 258)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=258&LinkId=20476
——————————
The wait operation timed out
——————————
BUTTONS:
OK
——————————

-> I connected to JBAG1 and opened a new query windows to one of the user database and tried a select statement, it worked. But queries are failing with below error when performing insert, delete or update.

Blog1_1

Msg 3906, Level 16, State 2, Line 4
Failed to update database “JBDB” because the database is read-only.

-> Latency between primary and secondary datacentre is 2 ms. Hence the setup is a synchronous replica with automatic failover. Please note that this setup will be a bad design if latency between datacentres are more and if there are frequent network glitches.

-> I checked further and below were my observations,

  1. JBAG2 is the PRIMARY Replica and SQL Server was down on it.
  2. JBAG1 was the secondary replica. It seems like automatic failover did not happen.

-> Always On Availability group on JBAG1 was in resolving state,

Blog1_3

-> I started the SQL Services on JBAG2 and in some time everything started working fine including Always On Availability groups.

-> Now comes the question, why automatic failover did not happen?

-> I opened cluadmin.msc and opened “Cluster Events” and found below errors,

Blog1_4

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 4/11/2020 8:29:08 AM
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: JBAG2.JBS.COM
Description:
Cluster resource ‘JBAG’ of type ‘SQL Server Availability Group’ in clustered role ‘JBAG’ failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”&gt;
<System>
<Provider Name=”Microsoft-Windows-FailoverClustering” Guid=”{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}” />
<EventID>1069</EventID>
<Version>1</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime=”2020-04-11T02:59:08.867171300Z” />
<EventRecordID>2818</EventRecordID>
<Correlation />
<Execution ProcessID=”2464″ ThreadID=”3496″ />
<Channel>System</Channel>
<Computer>JBAG2.JBS.COM</Computer>
<Security UserID=”S-1-5-18″ />
</System>
<EventData>
<Data Name=”ResourceName”>JBAG</Data>
<Data Name=”ResourceGroup”>JBAG</Data>
<Data Name=”ResTypeDll”>SQL Server Availability Group</Data>
</EventData>
</Event>

Blog1_5

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 4/11/2020 8:29:09 AM
Event ID: 1254
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: JBAG2.JBS.COM
Description:
Clustered role ‘JBAG’ has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”&gt;
<System>
<Provider Name=”Microsoft-Windows-FailoverClustering” Guid=”{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}” />
<EventID>1254</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime=”2020-04-11T02:59:09.469214500Z” />
<EventRecordID>2822</EventRecordID>
<Correlation />
<Execution ProcessID=”2464″ ThreadID=”3496″ />
<Channel>System</Channel>
<Computer>JBAG2.JBS.COM</Computer>
<Security UserID=”S-1-5-18″ />
</System>
<EventData>
<Data Name=”ResourceGroup”>JBAG</Data>
</EventData>
</Event>

-> Above error shows that the failover did not happen since it reached the failover threshold. Checking cluster.log in JBAG2 to confirm this. Refer to this article if you want to know the command to generate cluster.log.

-> Cluster.log provides the same reason,

[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 INFO [RCM] Resource JBAG is causing group JBAG to failover.
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 INFO [RCM] rcm::RcmGroup::Failover: (JBAG)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::FailedDueToError=> (JBAG, 5963, false)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::UpdateAndGetFailoverCount=> (1, 2020/04/11-08:25:33.204)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::ComputeFailoverThreshold=> (JBAG, 1, computed)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 WARN [RCM] Not failing over group JBAG, failoverCount 2, failoverThresholdSetting 4294967295, lastFailover 2020/04/11-08:25:33.204
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmGroup::FailedDueToError=> (JBAG, 5963, false)
[Verbose] 000009a0.000014c4::2020/04/11-08:31:18.014 DBG [RCM] rcm::RcmResource::DelayedRestart(JBAG_192.168.0.45)

-> Increasing the Failover threshold will fix this issue.

-> Open cluster administrator (cluadmin.msc). Click on Roles. Right Click the ROLE and click properties,

Blog1_6

-> As per below settings, Only 1 failover is allowed in last 6 hours. This makes sense why automatic failover did not happen.

Blog1_7

-> In my case I changed the value from 1 to 5. This resolved my issue with automatic failover.

Blog1_8

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Resizing SQL Server Database from single data file to multiple Data file

-> Requirement is to move data from a database that has 1 data file to 4 data files.

-> Existing setup,

SQL Server : SQL Server 2017
Database Size : 2 TB
Number of Data file(s) : 1
Data file size : 1.8 TB
Log file size : 200 GB

-> Solution requirement,

Number of Data files : 4
Data File 1 Size : 650 GB
Data File 2 Size : 650 GB
Data File 3 Size : 650 GB
Data File 4 Size : 650 GB
Log file size : 200 GB

-> Below tasks were undertaken on a test server initially.

-> Production database was restored on a test server. Additional 3 data drives of size 700 GB each added.

-> Database recovery model was changed from Full to simple.

-> Added close to 3 additional data files of size 650 GB on the 3 additional drives added.

-> Executed below command on the primary data file. This command basically moves data from all user objects from primary data file to additional data files that were added. This will result in all user objects to be moved from Primary data file to secondary data files added.

USE [DATABASE_NAME]
GO
DBCC SHRINKFILE (N'PRIMARY_DATA_FILE' , EMPTYFILE)
GO

-> The above command will be very slow. In my case it took close to 13 hours to complete. While the above command was executing I used below code to check the progress,


if convert(varchar(20),SERVERPROPERTY('productversion')) like '8%'
SELECT [name], fileid, filename, [size]/128.0 AS 'Total Size in MB',
[size]/128.0 - CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0 AS 'Available Space In MB',
CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0 AS 'Used Space In MB',
(100-((([size]/128.0 - CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0)/([size]/128.0))*100.0)) AS 'percentage Used'
FROM sysfiles
else
SELECT @@servername as 'ServerName',db_name() as DBName,[name], file_id, physical_name, [size]/128 AS
'Total Size in MB',
[size]/128.0 - CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0 AS 'Available Space In MB',
CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0 AS 'Used Space In MB',
(100-((([size]/128.0 - CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0)/([size]/128.0))*100.0)) AS 'percentage Used'
FROM sys.database_files
go

-> Reviewing the output from above query, specifically “Used Space in MB” and “Percentage Used” will provide us the details whether the process is progressing.

-> I stopped the resizing query when the primary data file’s “Used space in MB” reached 471,860 MB.

-> I am stopped this in-between just to make sure I am not moving all data from primary data file and then resulting in too much new data being inserted to primary data file later.

-> Shrinked the primary data file from 1.8 TB to 650 GB.

-> There are instances where shrink can take several hours if resizing is not completed fully. In my case it completed in 3 minutes using below command,


USE [DATABASE_NAME]
GO
DBCC SHRINKFILE (N'PRIMARY_DATA_FILE' , 665600, TRUNCATEONLY)
GO

-> In case shrinking the primary data file is very slow, you should allow the resizing to complete fully. You will get below error message when it completes,

Msg 1119, Level 16, State 1, Line 20
Removing IAM page (3:5940460) failed because someone else is using the object that this IAM page belongs to.

-> You can get more details about above error message from this article.

-> Reissue the shrink command and it will complete soon.

-> Problem with this is that you will experience more writes on the primary data file than other 3 data files and this can result in sub-optimal performance.

-> Perform a reindex on the database to ensure you remove any fragmentation as a result of resizing.

-> Changed the recovery model for the database from Simple to Full and performed a full backup.

-> This method worked out well for me in the test environment.

-> This method was replicated on our production environment after 6 months. It had issues while performing on the production environment due to below reasons,

  1. The scripts to increase the data file in the production environment was a copy of the script used in Test environment.
  2. The data growth in production was not taken into account in 6 months and the additional data file size added did not cope up with the additional data added.

-> Due to above issue, behavior on Production database server was as below,

  1. When resizing was started and we were checking the progress using the above query provided. We found that Additional data files “Used Space in MB” was increasing, and “Available Space In MB” decreasing.
  2. But in primary data file “Used Space in MB” and “Available Space In MB” did not change, it was static. Expected result should be that “Available Space In MB” should be increasing and “Used Space in MB” should be decreasing.
  3. It was stopped after 10 hours. We then realized that after the shrink with empty file command was terminated, primary data file “Used Space in MB” started coming down and “Available Space In MB” increasing. This took 1 more hour and were able to see some data moved from Primary data file to secondary data file.
  4. We then increased the additional data file size appropriately and then started executing the command. It moved the required amount of data and it was working as expected. I stopped the resize at a value where data files were having same amount of data and performed an index optimize.

-> In my case I was lucky that we took downtime for a whole weekend.

-> The whole process will not be possible on a production environment in case there is not downtime allowed.

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.