Clustered index with datetime/Id

Question

Clustered index with datetime/Id

Naomi Nosonovsky 8,431

Hi everybody,

Since I'm here today I think I'll ask another question. I have a huge audit table to which all processes always add rows (never delete or update) with default datetime and bigint identity column. We're making a clustered index on datetime, id columns in that order.

My question is - given the fact that this is highly used table (lots and lots of inserts), should that clustered index be on datetime, Id columns both in ASC order? We do use this table for ad-hoc selects very often and obviously we're interested in most recent dates (say, last 1 day at most). When I was testing queries I found them to be quite slow having index in ASC order but the INSERTs must outweigh the performance since we're adding about 100K+ or more rows per day.

Thanks in advance.

Accepted answer

1 additional answer

Your answer

Answer 1

Vladimir Moldovanenko 276

@Anonymous

I would do this

BEGIN TRANSACTION  
  
CREATE TABLE dbo.Tbl  
(  
    ID bigint NOT NULL IDENTITY(1, 1)  
    ,DT datetime2(3) NOT NULL  
    ,CONSTRAINT PK_ID PRIMARY KEY CLUSTERED (ID)  
    WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, OPTIMIZE_FOR_SEQUENTIAL_KEY = ON) ON [PRIMARY]  
)  
  
CREATE NONCLUSTERED INDEX IX_Tbl_DT ON dbo.Tbl (DT ASC)  
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100)  
  
ROLLBACK

Use SQL 2019, see

https://techcommunity.microsoft.com/t5/sql-server-blog/behind-the-scenes-on-optimize-for-sequential-key/ba-p/806888

I prefer sequence object to identity

Thanks

Naomi Nosonovsky 8,431 Reputation points

2022-02-08T21:16:05.5+00:00

Just a quick comment - on production we still use SQL 2016 as of today. Our Stage server (where I made the changes) uses SQL 2019.

Why do you suggest 2 different indexes and would non-clustered datetime index in ASC order help to find recent rows in some range (say, last 1 hour) ?
Vladimir Moldovanenko 276 Reputation points

2022-02-08T21:21:59.303+00:00

Also see https://learn.microsoft.com/en-US/troubleshoot/sql/performance/resolve-pagelatch-ex-contention
Vladimir Moldovanenko 276 Reputation points

2022-02-08T21:39:52.22+00:00

Because this table, my sample, is trivial.
Normally I have a lot more columns and I do need to query my tables, and I need quite a bit more indexes, so if PK is smaller, then other indexes are smaller too as clustered key is a part of all indexes.
Your situation may be different. If INSERT is your priority, and that is the only thing you care, then having PK on DT, ID will work slightly faster actually

Results were

DIFF_MS, PK = ID (100,000 rows INSERT)
270
vs

DIFF_MS, PK = DT, ID, (100,000 rows INSERT)
196
Vladimir Moldovanenko 276 Reputation points

2022-02-08T21:51:12.227+00:00

I ran a test with 1,000, 000 rows

DIFF_MS, PK = ID
2817

DIFF_MS, PK = DT, ID,
2552

therefore, the difference is not that large, but index is built as well
Naomi Nosonovsky 8,431 Reputation points

2022-02-08T21:53:33.17+00:00

Inserts are the top priority. The selects we're doing as ad-hoc and only few procedures are actually selecting from that table. We do use the ad-hoc selects when we need to research an issue as every process adds some info into that table.

I guess the bottom line is that I'll use dt/id index.
Vladimir Moldovanenko 276 Reputation points

2022-02-09T12:28:12.673+00:00

@Anonymous

Other thoughts for you, for having ID PK only.

PK should be minimalistic, and all other attributes/columns must depend on the key and only on the key. If ID is the PK, DT is attribute that depends only and only on PK. Any other columns you have should have the same dependency.

Moreover, if you need to change time resolution (don't rule this out) and go from datetime2(3) to datetime2(7), then you can do this and don't have to change your key.
And please use datetime2 (vs datetime).

PK may need to be migrated to any other related tables, if you have such, hence minimalistic requirement. For example, say you have LogHeader and LogDetails entities. Then just ID PK is better suited for this.

Also, according to my test results for pure insert, the difference is very minor, and we are splitting the hairs in this case, trying to rationalize one versus the other purely on the performance merit.

100K records a day is not high volume, let's not kid ourselves, however you only see and concentrate on this aspect only.

Therefore, any minor gains you may gain in pure INSERT speed are not worth design compromises. Changes to table structures are hard later, so make sure you make the best technical decision to start with, that is forward looking to future expansion, and based on solid data modeling.

Have a wonderful day
Thanks
Vladimir
Naomi Nosonovsky 8,431 Reputation points

2022-02-09T13:38:38.403+00:00

Hi Vladimir,

This particular table is standalone, there are no dependencies on it. It is used purely for auditing and it has only a few columns (process, subprocess, note, type, datetime, id, sp_id, entry type). Each process we're doing is making entries into this table, so it grows and grows. 100K per day was my rough estimation, it may be more than that.

I proposed to archive current data and start fresh, that's why we're thinking of changing the PK index (which is currently on ID column). Do you think clustered PK on Id + separate index on date is a better option? E.g. we can have clustered PK on Id + index on date desc, ID desc.

I need to make a decision ASAP as I already have a script and most likely we would have a deployment this Thursday (tomorrow).
Vladimir Moldovanenko 276 Reputation points

2022-02-09T15:05:08.697+00:00

IMHO, yes, it is better, for the above reasons.
I would implement ID as PK and index(es) on other columns

You may also consider building an index on computed column that is 'CAST(DT as date) ' so you can search for given date range, if that is what you do, instead of index on DT(datetiem2)
it will be smaller and may provide all you need. I don't know your exact search requiremnts.
Erland Sommarskog 121.8K Reputation points MVP Volunteer Moderator

2022-02-09T22:03:20.267+00:00
. Do you think clustered PK on Id + separate index on date is a better option? E.g. we can have clustered PK on Id + index on date desc, ID desc.

This is what I would go for.

I went ahead and changed datetime column to be datetime2(7) and sysdatetime for default.

datetime2 with a higher scale than 3 only make sense if you are storing timestamps from an external source. But you are using sysdatetime() which only gives you ms precision. Check this out:

SET NOCOUNT ON CREATE TABLE #t (t datetime2(7) NOT NULL) DECLARE @i int = 100000 WHILE @i > 0 BEGIN INSERT #t (t) VALUES(sysdatetime()) SET @i -= 1 END SELECT t, COUNT(*) FROM #t GROUP BY t ORDER BY t go DROP TABLE #t

Disclaimer: this is on Windows. It may be different on Linux.
Naomi Nosonovsky 8,431 Reputation points

2022-02-09T22:12:42.493+00:00

I cannot make up my mind yet. I did change the script to use datetime2(2), but in regards to indexes I just keep PRIMARY KEY CLUSTERED ([admin_audit_dttm], [admin_audit_id]). We did discuss it yesterday and I just don't want to open up a new discussion again. I don't know if we're migrating tomorrow or not.

Answer 2

Erland Sommarskog 121.8K MVP Volunteer Moderator

If you need to know the order the rows were inserted in, it is better to rely on the id column I think. This is because time can go backwards in modern computers. That is, a row B inserted after row A, may still have an earlier timestamp, because CPUs may not be in perfect sync.

Naomi Nosonovsky 8,431 Reputation points

2022-02-08T22:47:29.9+00:00

I saw rows with the same time, that's why I think time, Id may help. We're using plain datetime and GETDATE() for default, so we may expect collisions. I think for our purposes it's not crucial if the rows will be in wrong order of insertion, we don't select from this table except for ad-hoc queries.
Erland Sommarskog 121.8K Reputation points MVP Volunteer Moderator

2022-02-08T22:53:13.287+00:00

Yes, the id is a must, since collision can happen even without clocks going backwards.

A better choice of data type is datetime2(3) and the function sysdatetime(). Then you get 1 ms in resolution (at least in theory). datetime/getdate() only give you 3.33 ms.

Share via

Clustered index with datetime/Id

1 additional answer

Your answer