is sql server smart enough to use indexed row versions efficiently in multi join queries?

Question

Hi we will be on 2019 in a few months. i dont know whether it will be enterprise or std.

i'm a big fan of row versions (cast as bigint) as a better choice than datetimes to be indexed so we can quickly identify in incremental extracts only those records that changed since last time we ran an an extract. pls see the attached code for 2 sample tables and the control table we use to keep track of the last high key encountered on each table. you can see the first 2 records being inserted to the control table. basically the most current records in this table are "ended" on each extract and 2 new records with new high keys are inserrted.

i'm looking at a relatively deep join right now and have to wonder how smart/efficient sql will be in extracting only the latest changes. in the example i show 2 tables but in reality its n tables that basically chain to each other thru various combos of pk's and pointers and we can assume the pointers are fks too. and for simplicity i'd like to assume they each table has a row version cast as bigint indexed.

the predicate in the join might look something like

where (research1.row_version_number > x and reserach1.row_version_number <= y) or
(research2.row_version_number > x1 and research2.row_version_number <=y2) or...

and the join to the 2nd table might look something like this...

select ...from research1 r1
join research2 r2
on r1.apointer=r2.pk...

so i'm wondering if sql will throw its hands in the air and apply the worst strategy in joining the n tables or try to apply some cool algorithm that even understands the nth table changed but the 1st didnt , yet the 1st table's "related" record needs to be extracted anyway. if the answer is yes, is there a term for what it will do in its plan?

USE [research]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[research1](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    pk uniqueidentifier,
    aPointer uniqueidentifier,
    [aFact] [int] NULL,
    row_version_stamp rowversion,
    row_version_number as cast(ROW_VERSION_STAMP as bigint),

 CONSTRAINT [PK_research1] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[research2](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    pk uniqueidentifier,
    aPointer uniqueidentifier,
    [aFact] [int] NULL,
    row_version_stamp rowversion,
    row_version_number as cast(ROW_VERSION_STAMP as bigint),

 CONSTRAINT [PK_research2] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO


CREATE unique NONCLUSTERED INDEX [IX_research1_rowversionnumber] ON [dbo].[research1]
(
        ROW_VERSION_NUMBER
)
WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [primary]
GO

CREATE unique NONCLUSTERED INDEX [IX_research2_rowversionnumber] ON [dbo].[research2]
(
        ROW_VERSION_NUMBER
)
WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [primary]
GO


USE [research]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[JobControl](
    [JobControlId] [int] IDENTITY(1,1) NOT NULL,
    [PackageName] [varchar](100) NULL,
    [ServerName] [varchar](16) NULL,
    [InstanceName] [varchar](16) NULL,
    [DatabaseName] [varchar](128) NULL,
    [SchemaName] [varchar](128) NULL,
    [TableName] [varchar](128) NULL,
    [HighKey] [varchar](100) NULL,
    [StartDate] [datetime2](7) NULL,
    [EndDate] [datetime2](7) NULL,
PRIMARY KEY CLUSTERED 
(
    [JobControlId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY]
GO

declare @dt datetime2(7) = getdate()

insert jobcontrol
(
    [PackageName] ,
    [ServerName] ,
    [InstanceName] ,
    [DatabaseName] ,
    [SchemaName] ,
    [TableName] ,
    [HighKey] ,
    [StartDate] ,
    [EndDate]
)

select 'pkg1','server1',null,'research','dbo','research1','0',@dt,null
union
select 'pkg1','server1',null,'research','dbo','research2','0',@dt,null

Answer

There are many variables in the query plan which will determine if it will use the index or not. The bottom line is you will need to test it yourself and see. It may use the index on one table, and not on another table for many reason.

I would highly recommend adding PERSISTED to your calculated column.

However, I would not do this this way. I would use datetime and transfer all the records from the last run START date. Yes, it may transfer a few records again, but it should be a small number. Your method requires you to store the timestamp last sent. This could be problematic and is not really any different than using datetime.

Answer

I would not have too high hopes, but it is impossible to say when only seeing fragments of a query. My experience is that conditions with OR rarely work very well and a rewrite to a UNION query pays off, even if it may be more verbose.

While rowversion is convenient for change detection, there is a risk for missing updates if there are concurrent operations. Particularly, if your using some sort of snapshot.

Answer

thx gentlemen.

i should have mentioned that getting in and out of the oltp as fast as possible is usually my goal in incremental etl (near real time) processes.

Tom, we are persisting the computed column (it must be the default cuz i dont think about that anymore), and from my experience timestamps are less porous than date times in both directions (ie too many or too few records returned). And as you stated there is no difference (otherwise). In reality, timestamps always return exactly what you need and its impossible for the source system owners to change them for special one time fixes etc.

Erland, I cant imagine that sql will not at least stamp two concurrent updates with different timestamps. and the beauty of this approach is the predicate "or...> last high etl key and <= currently selected max key" which should pick everything up. we have a journal db so even if sql doesnt give me a consistent "state" with both the max keys and record contents between the min's and max's , missed timestamps (update or physical delete) will be sitting in the journal db with their original row version.

like both of you said, something told me sql wasnt that smart. I'm not bashing sql as that would be a tall order. I have some pretty good evidence now (after running a few queries) that sql isnt that smart. lets face it, in order to be that smart it would have to build temporary critical key "filters" top down and bottom up. The bottom up scenarios would be necessitated by the OR conditions.

i'm going to try the following strategy next while continuing to leave the timestamp indexes in place. i'm not logged into work but i believe the depth of the joins is no more than 15 tables. And only 6 or 7 of those tables really need a timestamp because the others are so static and small that they can be joined separately (ie abstracted from the core challenge here) once the critical data is extracted.

If we call the very first table in the query "top dog" or research1, the idea would be to collect all of top dog's PK candidates using the min/max logic in whatever direction is necessary from the 6 or 7 tables and then using a where exists clause over this population driving the query from top dog down. My expectation is that there will be no more than 80,000 candidates overall (on any given run) but that we can come close to that number from any one of these 6 or 7 PK generating result sets. My expectation also is that each pk generating query should run (or add) no longer than 4 seconds. and my gut says that this strategy might open my options on which table is now top dog. Maybe the table right in the middle (waistband?) of the original query should be top dog now, who knows?

Answer

Hi @db042190 ,

No.

SQL Server's query optimizer generates an efficient query execution plan when the select query is executed. The execution plan is generated on the basis of statistical information. If the statistical information is not updated in time, an inefficient execution plan may be generated. If the optimizer cannot choose the optimal plan, then you need to check the query plan, statistics, supported indexes, etc., and through the use of hints, you can change the optimizer's selection of the query plan project, so that the optimizer generates a better execution plan.

Regards
Echo

Answer

thx all.

erland i'm not totally understanding what you say. i am not using a snapshot. and each time i go to extract (every 15 mins) i'm asking for max row version (new high water mark) in the very same read/query (this is what i meant by "state") as the rest of the extracted data.

tom, i'm not sure what you mean by "transfer" but i'm guessing you mean "extract", ie etl. i am getting the new high value at the start of my extract and i think you guys are saying that even that doesnt guarantee that a higher value isn't yet committed or worse that a lower one isnt yet committed and will therefore be ignored by the > x and <= y predicate. if a higher value isnt yet committed, its no biggy because 15 minutes from now it could be the new high water mark or minimally be a candidate for extraction in the next extract. it seems to me that the same pitfalls being discussed here apply to datetimes. the same would seem to be true for cdc because i doubt that cdc cares about uncommitted data. and in a way its kind of too late if we get the uncommitted data 15 minutes from now. we dont really like cdc anyway.

i'll read about min_active_row_version to see if i can get on the same page as you guys. my goal is not to miss "in flux" updates whose row version is less than what sql says is currently the max row version. i wonder if there is a command that can be issued that waits for those to get committed. in a way, i'm creating a snapshot of my own. the snapshot is more or less operational and started over each day. it would be best if it doesnt miss in flux row versions that are less than the current value of max rowversion.

is sql server smart enough to use indexed row versions efficiently in multi join queries?

10 answers