Safe way to remove uniqueidentifier which is a primary key (clustered index) on a large 300 GB table in SQL Azure DB

Question

Safe way to remove uniqueidentifier which is a primary key (clustered index) on a large 300 GB table in SQL Azure DB

mo boy 396

Dear All,

I have this table on an Azure SQL DB (S4 compute tier) which is 300 GB in size.
On checking, I saw that the primary key is on an uniqueidetifier ID column. The fragmentation is over 99% as expected. As we know, it is not recommended to have this uniqueidentifier column as a primary key, I am planning to remove it. So, I am thinking of having a new bigint column as an identity column. I am selecting bigint as this table has around 900 million rows.

Below are the steps, I am thinking of performing. I am planning to upgrade to P15 while performing these steps.
Could you please advise if this is a safe approach?

There are no other constraints in this table.

---Drop primary key to drop uniqueidentified column as primary key
ALTER TABLE LogAudit
DROP CONSTRAINT PK__LogAudit__3214EC0704BD7CFB
GO

--- Add new Identify column
ALTER TABLE dbo.LogAudit
ADD ID1 BIGINT IDENTITY(1,1)

--Make the Identity column as the primary key
ALTER TABLE dbo.LogAudit
ADD CONSTRAINT PK_LogAuditID1 PRIMARY KEY(ID1)

Mike Ubezzi 2,776 Reputation points

2020-08-25T01:47:24.773+00:00

Hi @mo boy - Did you have any additional questions or need more information than what is being provided here, to help you with your task?
mo boy 396 Reputation points

2020-08-26T08:55:52.767+00:00

No, thanks Mike. I am happy with the answers :-)

Accepted answer

4 additional answers

Your answer

Mike Ubezzi 2,776 Reputation points

2020-08-25T01:47:24.773+00:00

Hi @mo boy - Did you have any additional questions or need more information than what is being provided here, to help you with your task?
mo boy 396 Reputation points

2020-08-26T08:55:52.767+00:00

No, thanks Mike. I am happy with the answers :-)

Answer 1

Dan Guzman 9,406

Given this is an audit table rather than a relational table (i.e. no referential constraints), I suggest you don't create a primary key at all. This may seem like heresy to some but introducing a surrogate key that is never used in queries or referenced by foreign key constraints serves no purpose. Consider creating a non-unique clustered index on a column like AuditTimestamp that may be useful to SELECT queries and incremental as well. This will improve performance and buffer efficiency.

Dropping and recreating a clustered index is expensive by nature because the table will be converted to a heap during the drop and back to a normal table when the clustered index is created. P15 will help reduce the duration but be aware the table will be unavailable during the operation due when you specify ONLINE=OFF. You could specify ONLINE=ON for both the DROP and CREATE if you want to allow concurrent access to the table during the changes. However, if you add a new IDENTITY column, that will require exclusive access to the table while the column is added to every row.

mo boy 396 Reputation points

2020-08-16T12:10:40.493+00:00

Thank you for your reply. Very useful suggestions indeed.

I can see that this table has two columns below. I think I can consider using one of them as a non-unique clustered index.
I have few proposals to provide my team :-)
But out of curiosity, why not a unique clusteredindex on the timestamp?
Also, is there chances of fragmentation if we use timestamp as non unique clusteredindex as the data type is date time

EventDate
Eventtime
Ronen Ariely 15,206 Reputation points

2020-08-20T05:15:06.97+00:00

By the way Dan @DanGuzman-5735 , why do you not edit the automatic name that the system gave you, and remove the number of the name? You can use "DanGuzman" instead of "DanGuzman-5735"

If you did not know, then here is how to do this:

Click on your image on the right upper corner and select "Profile" ->in the profile page on the left menu click on "Setting"

Here you can change your "Username" and your "Display Name".

Note: I checked and your name is free to be used at this time: https://learn.microsoft.com/en-us/users/danguzman/
mo boy 396 Reputation points

2020-08-26T05:30:27.317+00:00

Thanks, I tried this approach and this appears to have slowed down IO considerably as usually around 2 million records get loaded in this table per day. Overall system performance has improved

Answer 2

Ronen Ariely 15,206

Bug while posting the answer...

The short respond is that your approach is very bad and I explain why in the file 🙂

I attach the answer in text file

You will not see the image but the rest is clear

17810-forumanswer67646.txt

mo boy 396 Reputation points

2020-08-16T10:46:09.107+00:00

Thank you, I have provided additional details
Ronen Ariely 15,206 Reputation points

2020-08-20T04:54:15.96+00:00

Great :-)

I'm back !

Sorry took me long time to be back. It is almost impossible to follow our threads in this system.

Let me read the update information and see how I contribute a bit more

Answer 3

Thank you so much for your detailed response. Sorry, I should have been more clear.
I totally agree with your feedback.

The clustered index is defined on the primary key column (uniqueidentifier). Below is the scripted index.

/****** Object: Index [PK_dbo.LogAudit] ******/
ALTER TABLE [dbo].[LogAudit] ADD CONSTRAINT [PK_dbo.LogAudit] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
GO

The primary key is on this column. There are no other indexes on this table. There is only an additional constraint which has a default for value 0 for one of the columns on this table.

Based on this, we are thinking of making below changes in order to remove the clustered index on this uniqueidentifier column, then add a new identity column and place clustered index on it. All of this will be done on the P15 compute tier for the duration of the change as this table is around 300 GB in size. Please share your thoughts based on this approach.

---Drop clustered index on the table--
ALTER TABLE LogAudit
DROP CONSTRAINT [PK_dbo.LogAudit]
GO

--- Add new Identify column---
ALTER TABLE dbo.LogAudit
ADD ID1 BIGINT IDENTITY(1,1)

--Make the new Identity column as the clustered index--
ALTER TABLE [dbo].[LogAudit] ADD CONSTRAINT [PK_dbo.LogAudit] PRIMARY KEY CLUSTERED
(
[ID1] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
GO

Answer 4

Dan Guzman 9,406

But out of curiosity, why not a unique clusteredindex on the timestamp?

UNIQUE is appropriate if you can ensure the timestamp value is indeed unique. I don't know details of your audit data but consider scenarios where the same timestamp value could be generated, such as by concurrent sessions, a single session with audited actions in quick succession, or multiple audit rows for the same audited action/event.

Answer 5

My second 2 cents :-)

ADD CONSTRAINT [PK_dbo.LogAudit]

Off-topic: I highly recommend NOT to use dot in entities names. The dot character has a meaning/convention of a "child". FOr example X.Y usually mean that Y is a child of the entity X

@DanGuzman-5735: Given this is an audit table rather than a relational table (i.e. no referential constraints), I suggest you don't create a primary key at all.

Who need Keys?!?

Candidate key is a unique identifier of the row/record. A surrogate key is a type of Candidate key which does not have any contextual or business meaning (like "id" for example or uniqueidetifier). A natural key is also a type of Candidate key that has contextual or business meaning. Both may or may not be used as the primary key.

A "Primary key" is one of the "Candidate keys", which is used in the relational model for references to this row in the table. So this is basically true. If your table does not have relations with other tables then you do not need a Primary Key.

There is still need for at least one Candidate key in order to recognize specific row in the table (or record on the disk), but SQL Server take care of this if there is no clustered Index.

Clustered Indexes is something else! Tables without clustered indexes are called "heaps". This can improve INSERT dramatically since there is no need for any order for the records on the disk. The server can simply write the records wherever is the the fastest. Unfortunately, nothing is free and this approach slow any query that need to find the row (select, update, any use of filter or sorting). Heap table might fit for staging tables and Log tables for example. An audit table is usually a type of log table.

Note: SQL Server has a mechanism to identify the physical address of a row. It is undocumented but we can also use this information using "%%physloc%%" like it was a column name (For example select *, %%physloc%%, sys.fn_physLocFormatter(%%physloc%%) from T) - remember that this value can be changed since it based on the physical location of the row.

back to the original question and the approach you present

You should take into consideration how you use the data before you make any change like removing a column. Dropping the column can impact your application. If you are using this column in any references then you might want to keep the columns data as it is. Removing the primary key and the Clustered Index can be done without removing the column.

Moreover, removing a column (without any other action) will not remove the column under the scenes, and we will even be able to read the data from that column (using undocumented tools) after it was dropped. This is one of the topics which I will demonstrate and explain in my lecture at the data platform summit 2020.

mo boy 396 Reputation points

2020-08-26T05:31:05.12+00:00

thank you again for your detailed explanation.

I learnt a lot :)

Share via

Safe way to remove uniqueidentifier which is a primary key (clustered index) on a large 300 GB table in SQL Azure DB

4 additional answers

Who need Keys?!?

back to the original question and the approach you present

Your answer