Troubleshooting CDC enabling failure – Part 1

My customer had recently experienced 2 interesting CDC issues and I would like to share the troubleshooting process and root cause for these issues today.

 

The issue:

During the weekend, my customer tried to promote some changes to production server and while they tried to enable CDC on production server, they first encountered a security error:

 

#1: Security error:

Msg 22832, Level 16, State 1, Procedure sp_cdc_enable_table_internal, Line 622

Could not update the metadata that indicates table [dbo].[Table_Name] is enabled for Change Data Capture. The failure occurred when executing the command '[sys].[sp_cdc_add_job] @job_type = N'capture''. The error returned was 916: 'The server principal "S-1-9-3-1293060401-1231192665-3834407059-1208013724." is not able to access the database "msdb" under the current security context.'. Use the action and error to determine the cause of the failure and resubmit the request

 

My customer soon found a workaround for this issue by manually creating 2 CDC jobs:

EXEC [sys].[sp_cdc_add_job] @job_type = N'capture';

EXEC [sys].[sp_cdc_add_job] @job_type = N'cleanup';

 

After manually creating these 2 jobs, they encountered 2nd issue:

#2: Nullable column:

Msg 22832, Level 16, State 1, Procedure sp_cdc_enable_table_internal, Line 622

Could not update the metadata that indicates table [dbo].[Table_Name] is enabled for Change Data Capture. The failure occurred when executing the command 'insert into [cdc].[change_tables]'. The error returned was 515: 'Cannot insert the value NULL into column 'has_drop_pending', table 'LLCProduction.cdc.change_tables'; column does not allow nulls. INSERT fails.'. Use the action and error to determine the cause of the failure and resubmit the request.

 

Again, this issue was workaround by updating database default setting ANSI_NULL_DEFAULT to ON:

ALTER DATABASE DB_Name SET ANSI_NULL_DEFAULT ON;

 

 

The unanswered questions:

Although this customer now has CDC enabled, the questions remain unanswered – why the issues happened? How can DBAs avoid running into these issues in the future?

 

The troubleshooting:

We have verified a few things on production server:

  1. DBA has sysadmin privilege
  2. Based on this article: /en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server , SQL Server Engine Service Account and Agent Service Account are not NETWORK SERVICE

 

And we tried to reproduce these issues in test environment with same build number (10.0.4064) and CDC enabled successfully.

 

It appears we are running out of leads. I decided to take a look at the CDC creation process, especially agent job creation process and here is what I found out:

Stored Procedure "sys.sp_cdc_enable_table" calls "sys.sp_cdc_enable_table_internal" who then calls a few other stored procedures. During these calls, there are 2 times security context switched.

Below is pseudocode for this process:

"sys.sp_cdc_enable_db"

  • Create user cdc and cdc system objects

"sys.sp_cdc_enable_table"

  • Execute as User = cdc

  • Exec sys.sp_cdc_enable_internal

    à Exec sys.sp_cdc_add_job

        à Exec sys.sp_cdc_add_job_internal (execute as dbo)

            à Create msdb.dbo.cdc_jobs table & create jobs

 

The error raised in "sys.sp_cdc_add_job" and the error message is "not able to access the database 'msdb'". Access to msdb is in "sys.sp_cdc_add_job_internal" and this stored procedure is executed as user "dbo" in user database. Now this issue comes down to why 'dbo' user in user database doesn't have access to msdb database? While reviewing the msdb in the production server, my DBA friend Frank Fan pointed out the "guest" user in msdb is disabled.

 

And on other instances that CDC enabled successfully, all of which have "guest" user enabled! To be sure about msdb "guest" user could cause CDC failure, I quickly ran below test in my SQL 2016SP1 instance and suceessfully reproduce the same issue:

#1: Diable "guest" user in msdb by revoking connect permission:

revoke
connect
to guest

go

#2: Create test database and try enable CDC:

create
database cdcdb

GO

USE cdcdb

GO

 

create
table t1(id int
primary
key, msg nvarchar(20))

GO

Set
ANSI_NULL_DFLT_OFF
OFF

GO

EXEC
sys.sp_cdc_enable_db

GO

exec
sys.sp_cdc_enable_table

@source_schema =
N'dbo'

, @source_name =
N't1'

, @role_name =
N'cdc_Admin';

GO

CDC failed to enable and gave same error message:

Msg 22832, Level 16, State 1, Procedure sp_cdc_enable_table_internal, Line 639 [Batch Start Line 27]

Could not update the metadata that indicates table [dbo].[t1] is enabled for Change Data Capture.

The failure occurred when executing the command '[sys].[sp_cdc_add_job] @job_type = N'capture''.

The error returned was 916: 'The server principal "S-1-9-3-330084520-1159009536-4224874419-612529348"

is not able to access the database "msdb" under the current security context.'. Use the action and

error to determine the cause of the failure and resubmit the request.

 

 

To fix this error, enabling "guest" user in msdb by grant connect permission to guest user, "sys.sp_cdc_enabled_table" finishes successfully.

Lesson learned: Leave "guest" user enabled in master, msdb and tempdb. See Microsoft recommendation: You should not disable the guest user in the msdb database in SQL Server

Hope this helps you in similar cases. We will discuss 2nd issue in the next blog.

Special thanks for my SQL DBA friend Frank Fan for pointing the guest user was disabled in msdb, which saved us significant time and effort during this investigation.