Transactional replication – Administering distribution database size

Environment

-> Below is the replication topology,

-> I have three (3) publications. Publication JBDBREPL has 3 articles, Publication Jimburepl has 1 article an Publication Jalorepl has 1 article.

-> Distribution database size is too huge for such a simple replication setup. Distribution database is close to 20 GB.

-> Checking replication related object details,

-> MSrepl_Commands and MSrepl_transactions table are too huge. They are close to 55+ Million rows.

-> First thing that comes to my mind is that Distribution cleanup job is either failing or getting blocked. Check Distribution cleanup job history,

-> The job is executing fine every 10 mins. Lets check if cleanup is happening from the job history,

Date 11/8/2020 12:05:00 AM
Log Job History (Distribution clean up: distribution)

Step ID 1
Server JBRESEARCH\IN2019
Job Name Distribution clean up: distribution
Step Name Run agent.
Duration 00:00:28
Sql Severity 0
Sql Message ID 21010
Operator Emailed
Operator Net sent
Operator Paged
Retries Attempted 0

Message
Executed as user: NT SERVICE\SQLAgent$IN2019. Deleted 0 row(s) per millisecond from MSrepl_commands [SQLSTATE 01000] (Message 22121) Deleted 0 row(s) per millisecond from MSrepl_transactions [SQLSTATE 01000] (Message 22121) Deleted 0 row(s) per millisecond from MSrepl_commands [SQLSTATE 01000] (Message 22121) Deleted 0 row(s) per millisecond from MSrepl_transactions [SQLSTATE 01000] (Message 22121) Deleted 0 row(s) per millisecond from MSrepl_commands [SQLSTATE 01000] (Message 22121) Deleted 0 row(s) per millisecond from MSrepl_transactions [SQLSTATE 01000] (Message 22121) Deleted 0 row(s) per millisecond from MSrepl_commands [SQLSTATE 01000] (Message 22121) Deleted 0 row(s) per millisecond from MSrepl_transactions [SQLSTATE 01000] (Message 22121) Removed 0 replicated transactions consisting of 0 statements in 19304 milliseconds (0 rows/millisec). [SQLSTATE 01000] (Message 21010). The step succeeded.

-> Lets execute below query manually,

EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 72

Deleted 0 row(s) per millisecond from MSrepl_commands
Deleted 0 row(s) per millisecond from MSrepl_transactions
Deleted 0 row(s) per millisecond from MSrepl_commands
Deleted 0 row(s) per millisecond from MSrepl_transactions
Deleted 0 row(s) per millisecond from MSrepl_commands
Deleted 0 row(s) per millisecond from MSrepl_transactions
Deleted 0 row(s) per millisecond from MSrepl_commands
Deleted 0 row(s) per millisecond from MSrepl_transactions
Removed 0 replicated transactions consisting of 0 statements in 17633 milliseconds (0 rows/millisec).

Completion time: 2020-11-08T00:19:44.9257397+05:30

-> It seems like the Distribution cleanup job is completing fine, but no rows are getting cleaned up.

-> Executing below query to check min distrib retention, max distrib retention and history retention,

sp_helpdistributor

-> The setting seems normal. The most common reason for distribution database growing is when immediate_sync is set to true. Lets check what immediate_sync is set to in our case,

-> It is set to 0 for all 3 publications. So it is not the cause.

-> Execute below query which will provide databases which have obsolete publications/subscriptions that are not being cleaned up, MinDate that reveals the date of oldest transaction in Msrepl_Transactions table, MaxDate that reveals the date of latest transaction in Msrepl_Transactions table and Commands refers to total data available.

 USE distribution
 GO
 SELECT
     t.publisher_database_id,
     MinDate        = MIN(entry_Time),
     MaxDate        = MAX(entry_Time), 
     Commands = COUNT_BIG(*)
 FROM
     dbo.msrepl_transactions t
     join dbo.msrepl_commands s
     ON  s.publisher_database_id = t.publisher_database_id
     AND s.xact_seqno            = t.xact_seqno 
 WHERE
     entry_time < GETDATE()
 GROUP BY
     t.publisher_database_id
 GO

-> This query execution time depends on the number of records present in table Msrepl_Transaction and Msrepl_Commands.

-> From the above data it is clear that we have 55 Million rows and that the oldest transaction is on 05 November 2020 19:55. It seems like cleanup is not happening after 05 November 2020 19:55.

-> Execute below query which I took from this article.

set nocount on
 declare    @publisher_database_id int,
         @min_cutoff_time datetime,
         @max_cleanup_xact_seqno varbinary(16),
         @retention int, 
         @rptdtl int 
 select @retention = 72  --If the retention period has been changed, then update this to that value
 select @publisher_database_id = NULL  --if you want to look for specific database, change this from NULL to value from MSPublisher_databases table.
 select @rptdtl = 2  --if you want more information about agents keeping the xact_seqno to a minimum, change this to 2.
 select @min_cutoff_time = dateadd(hour, -@retention, getdate())
 declare @min_agent_sub_xact_seqno varbinary(16)
         ,@max_agent_hist_xact_seqno varbinary(16)
         ,@active int
         ,@initiated int
         ,@agent_id int
         ,@min_xact_seqno varbinary(16)
 --Loop through each database
 if @publisher_database_id is null
 begin
    declare pub_db scroll cursor for  
    select id from MSpublisher_databases    
 end
 else
 begin
    declare pub_db scroll cursor for  
    select id from MSpublisher_databases 
    where id = @publisher_database_id
 end
 open pub_db
 fetch first from pub_db into @publisher_database_id      
 while @@FETCH_STATUS = 0
 begin      
 -- set @min_xact_seqno to NULL and reset it with the first prospect of min_seqno we found later
 select @min_xact_seqno = NULL
 select @active = 2
 select @initiated = 3
 -- -- cursor through each agent with it's smallest sub xact seqno --
 declare #tmpAgentSubSeqno cursor local forward_only  for
 select a.id, min(s2.subscription_seqno) from MSsubscriptions s2 join MSdistribution_agents a
 on (a.id = s2.agent_id) 
 where s2.status in( @active, @initiated ) and
                             /* Note must filter out virtual anonymous agents !!!
                                       a.subscriber_id <> @virtual_anonymous and */
                             -- filter out subscriptions to immediate_sync publications
 not exists (select * from MSpublications p 
             where s2.publication_id = p.publication_id 
              and p.immediate_sync = 1) 
 and a.publisher_database_id = @publisher_database_id
 group by a.id
 open #tmpAgentSubSeqno 
 fetch #tmpAgentSubSeqno into @agent_id, @min_agent_sub_xact_seqno 
 if (@@fetch_status = -1) -- rowcount = 0 (no subscriptions)
 begin
         -- If we have a publication which allows for init from backup with a min_autonosync_lsn set
         --   we don't want this proc to signal cleanup of all commands
         -- Note that if we filter out immediate_sync publications here as they will already have the
         --   desired outcome.  The difference is that those with min_autonosync_lsn set have a watermark
         --   at which to begin blocking cleanup.
     if not exists (select * from dbo.MSpublications msp join MSpublisher_databases mspd 
                    ON mspd.publisher_id = msp.publisher_id 
                     and mspd.publisher_db = msp.publisher_db
                    where mspd.id = @publisher_database_id 
                     and msp.immediate_sync = 1)
     begin
        select top(1) @min_xact_seqno = msp.min_autonosync_lsn from dbo.MSpublications msp join MSpublisher_databases mspd 
        ON mspd.publisher_id = msp.publisher_id 
        and mspd.publisher_db = msp.publisher_db
        where mspd.id = @publisher_database_id 
         and msp.allow_initialize_from_backup <> 0
          and msp.min_autonosync_lsn is not null
           and msp.immediate_sync = 0
        order by msp.min_autonosync_lsn asc
     end
  end
 while (@@fetch_status <> -1)
  begin
         --
         --always clear the local variable, next query may not return any resultset
         --
     set @max_agent_hist_xact_seqno = NULL
         --
         --find last history entry for current agent, if no history then the query below should leave @max_agent_xact_seqno as NULL
         --
     select top 1 @max_agent_hist_xact_seqno = xact_seqno from MSdistribution_history 
     where agent_id = @agent_id 
     order by timestamp desc
 if @rptdtl = 2 begin    select @publisher_database_id as Pub_DB_ID,@agent_id as AgentID,@max_agent_hist_xact_seqno as Max_Agent_Hist_Xact_Seqno,           @min_agent_sub_xact_seqno as Min_Agent_Sub_Xact_Seqno end     --     --now find the last xact_seqno this agent has delivered:     --if last history was written after initsync, use histry xact_seqno otherwise use initsync xact_seqno             -- if isnull(@max_agent_hist_xact_seqno, @min_agent_sub_xact_seqno) <= @min_agent_sub_xact_seqno  begin    set @max_agent_hist_xact_seqno = @min_agent_sub_xact_seqno end     --@min_xact_seqno was set to NULL to start with, the first time we get here, it'll gets set to a non-NULL value     --then we graduately move to the smallest hist/sub seqno if ((@min_xact_seqno is null) or (@min_xact_seqno > @max_agent_hist_xact_seqno)) begin     set @min_xact_seqno = @max_agent_hist_xact_seqno  end fetch #tmpAgentSubSeqno into @agent_id, @min_agent_sub_xact_seqno 
 end
  close #tmpAgentSubSeqno
  deallocate #tmpAgentSubSeqno
     /* 
     ** Optimized query to get the maximum cleanup xact_seqno
     /     / 
     ** If the query below returns nothing, nothing can be deleted.
     ** Reset @max_cleanup_xact_seqno to 0.
     */
  select @max_cleanup_xact_seqno = 0x00
     -- Use top 1 to avoid warning message of "Null in aggregate…" which will make
     -- sqlserver agent job having failing status
 if @min_xact_seqno is NULL and @rptdtl = 2 and @agent_id is NULL
    begin
       select @publisher_database_id as Pub_DB_ID, 'No immediate sync publications, so is not dependent on Agent History'
    end
    select top 1 @max_cleanup_xact_seqno = xact_seqno from MSrepl_transactions with (nolock)
    where publisher_database_id = @publisher_database_id 
    and (xact_seqno < @min_xact_seqno
    or @min_xact_seqno IS NULL) 
    and entry_time <= @min_cutoff_time
    order by xact_seqno desc
 select @publisher_database_id as Pub_DB_ID, @max_cleanup_xact_seqno as Max_Cleanup_Xact_Seqno
 fetch next from pub_db into @publisher_database_id    
 end
 close pub_db
 deallocate pub_db

-> Below is the output of the above query,

-> Lets use below query to check MSrepl_transactions to reveal the date when transactions were recorded in the Distribution database using Max_Agent_Hist_Xact_Seqno from above output,

SELECT [publisher_database_id], [xact_id],[xact_seqno],CONVERT(char(10), [entry_time],126) [entry_time]
 FROM [distribution].[dbo].[MSrepl_transactions]
 WHERE xact_seqno =0x000059B4000014480003000000000000 --Agent ID 11

 SELECT [publisher_database_id],[xact_id],[xact_seqno],CONVERT(char(10), [entry_time],126) [entry_time]
 FROM [distribution].[dbo].[MSrepl_transactions]
 WHERE xact_seqno =0x000059B4000014480003000000000000 --Agent ID 14

 SELECT [publisher_database_id],[xact_id],[xact_seqno],CONVERT(char(10), [entry_time],126) [entry_time]
 FROM [distribution].[dbo].[MSrepl_transactions]
 WHERE xact_seqno =0x00000053000071100009000000000000 --Agent ID 19

-> From the output above it is clear that Agent ID 19 which has the entry_time 05 November 2020 has issues.

-> Lets use below query to find the publication associated with Agent ID 19.

select id as Agent_ID, publication as Publication from MSdistribution_agents where id in (19)

-> In my case it is publication JaloRepl. Let check the replication monitor and see if there are any errors with this publication,

-> It seems like the distribution agent was stopped and the stop date is around 05 November 2020 19:55.

-> One way is to check if this publication is required and can be removed if it is not required. In my case, the publication was required.

-> I tried starting it and got below error,

Command attempted:
if @@trancount > 0 rollback tran
(Transaction sequence number: 0x0000005300007120000400000000, Command ID: 1)

Error messages:
The row was not found at the Subscriber when applying the replicated UPDATE command for Table ‘[dbo].[jalo]’ with Primary Key(s): [sno] = 2 (Source: MSSQLServer, Error number: 20598)
Get help: http://help/20598
The row was not found at the Subscriber when applying the replicated UPDATE command for Table ‘[dbo].[jalo]’ with Primary Key(s): [sno] = 2 (Source: MSSQLServer, Error number: 20598)
Get help: http://help/20598

-> It seems like the distribution agent was stopped after this error without troubleshooting further. We found the missing row and created it on the subscriber. Distribution agent started delivering data after this. Distribution cleanup started working. I checked the distribution cleanup job history and was able to confirm that.

-> I tried executing below query manually again,

EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 72

-> I let the distribution cleanup job run for sometime and below are the MSrepl_transactions and MSrepl_commands stats below,

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s