Safe way to remove uniqueidentifier which is a primary key (clustered index) on a large 300 GB table in SQL Azure DB

mo boy 396 Reputation points
2020-08-16T01:34:15.4+00:00

Dear All,

I have this table on an Azure SQL DB (S4 compute tier) which is 300 GB in size.
On checking, I saw that the primary key is on an uniqueidetifier ID column. The fragmentation is over 99% as expected. As we know, it is not recommended to have this uniqueidentifier column as a primary key, I am planning to remove it. So, I am thinking of having a new bigint column as an identity column. I am selecting bigint as this table has around 900 million rows.

Below are the steps, I am thinking of performing. I am planning to upgrade to P15 while performing these steps.
Could you please advise if this is a safe approach?

There are no other constraints in this table.

---Drop primary key to drop uniqueidentified column as primary key
ALTER TABLE LogAudit
DROP CONSTRAINT PK__LogAudit__3214EC0704BD7CFB
GO

--- Add new Identify column
ALTER TABLE dbo.LogAudit
ADD ID1 BIGINT IDENTITY(1,1)

--Make the Identity column as the primary key
ALTER TABLE dbo.LogAudit
ADD CONSTRAINT PK_LogAuditID1 PRIMARY KEY(ID1)

Azure SQL Database
SQL Server | Other
{count} votes

Accepted answer
  1. Dan Guzman 9,406 Reputation points
    2020-08-16T11:35:20.437+00:00

    Given this is an audit table rather than a relational table (i.e. no referential constraints), I suggest you don't create a primary key at all. This may seem like heresy to some but introducing a surrogate key that is never used in queries or referenced by foreign key constraints serves no purpose. Consider creating a non-unique clustered index on a column like AuditTimestamp that may be useful to SELECT queries and incremental as well. This will improve performance and buffer efficiency.

    Dropping and recreating a clustered index is expensive by nature because the table will be converted to a heap during the drop and back to a normal table when the clustered index is created. P15 will help reduce the duration but be aware the table will be unavailable during the operation due when you specify ONLINE=OFF. You could specify ONLINE=ON for both the DROP and CREATE if you want to allow concurrent access to the table during the changes. However, if you add a new IDENTITY column, that will require exclusive access to the table while the column is added to every row.


4 additional answers

Sort by: Most helpful
  1. Ronen Ariely 15,206 Reputation points
    2020-08-16T08:13:30.537+00:00

    Bug while posting the answer...

    The short respond is that your approach is very bad and I explain why in the file 🙂

    I attach the answer in text file

    You will not see the image but the rest is clear

    17810-forumanswer67646.txt

    1 person found this answer helpful.

  2. mo boy 396 Reputation points
    2020-08-16T10:38:07.123+00:00

    Thank you so much for your detailed response. Sorry, I should have been more clear.
    I totally agree with your feedback.

    The clustered index is defined on the primary key column (uniqueidentifier). Below is the scripted index.

    /****** Object: Index [PK_dbo.LogAudit] ******/
    ALTER TABLE [dbo].[LogAudit] ADD CONSTRAINT [PK_dbo.LogAudit] PRIMARY KEY CLUSTERED
    (
    [Id] ASC
    )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
    GO

    The primary key is on this column. There are no other indexes on this table. There is only an additional constraint which has a default for value 0 for one of the columns on this table.

    Based on this, we are thinking of making below changes in order to remove the clustered index on this uniqueidentifier column, then add a new identity column and place clustered index on it. All of this will be done on the P15 compute tier for the duration of the change as this table is around 300 GB in size. Please share your thoughts based on this approach.

    ---Drop clustered index on the table--
    ALTER TABLE LogAudit
    DROP CONSTRAINT [PK_dbo.LogAudit]
    GO

    --- Add new Identify column---
    ALTER TABLE dbo.LogAudit
    ADD ID1 BIGINT IDENTITY(1,1)

    --Make the new Identity column as the clustered index--
    ALTER TABLE [dbo].[LogAudit] ADD CONSTRAINT [PK_dbo.LogAudit] PRIMARY KEY CLUSTERED
    (
    [ID1] ASC
    )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
    GO

    0 comments No comments

  3. Dan Guzman 9,406 Reputation points
    2020-08-16T12:34:56.323+00:00

    But out of curiosity, why not a unique clusteredindex on the timestamp?

    UNIQUE is appropriate if you can ensure the timestamp value is indeed unique. I don't know details of your audit data but consider scenarios where the same timestamp value could be generated, such as by concurrent sessions, a single session with audited actions in quick succession, or multiple audit rows for the same audited action/event.

    0 comments No comments

  4. Ronen Ariely 15,206 Reputation points
    2020-08-20T07:00:15.517+00:00

    My second 2 cents :-)

    ADD CONSTRAINT [PK_dbo.LogAudit]

    Off-topic: I highly recommend NOT to use dot in entities names. The dot character has a meaning/convention of a "child". FOr example X.Y usually mean that Y is a child of the entity X

    @DanGuzman-5735: Given this is an audit table rather than a relational table (i.e. no referential constraints), I suggest you don't create a primary key at all.

    Who need Keys?!?

    Candidate key is a unique identifier of the row/record. A surrogate key is a type of Candidate key which does not have any contextual or business meaning (like "id" for example or uniqueidetifier). A natural key is also a type of Candidate key that has contextual or business meaning. Both may or may not be used as the primary key.

    A "Primary key" is one of the "Candidate keys", which is used in the relational model for references to this row in the table. So this is basically true. If your table does not have relations with other tables then you do not need a Primary Key.

    There is still need for at least one Candidate key in order to recognize specific row in the table (or record on the disk), but SQL Server take care of this if there is no clustered Index.

    Clustered Indexes is something else! Tables without clustered indexes are called "heaps". This can improve INSERT dramatically since there is no need for any order for the records on the disk. The server can simply write the records wherever is the the fastest. Unfortunately, nothing is free and this approach slow any query that need to find the row (select, update, any use of filter or sorting). Heap table might fit for staging tables and Log tables for example. An audit table is usually a type of log table.

    Note: SQL Server has a mechanism to identify the physical address of a row. It is undocumented but we can also use this information using "%%physloc%%" like it was a column name (For example select *, %%physloc%%, sys.fn_physLocFormatter(%%physloc%%) from T) - remember that this value can be changed since it based on the physical location of the row.

    back to the original question and the approach you present

    You should take into consideration how you use the data before you make any change like removing a column. Dropping the column can impact your application. If you are using this column in any references then you might want to keep the columns data as it is. Removing the primary key and the Clustered Index can be done without removing the column.

    Moreover, removing a column (without any other action) will not remove the column under the scenes, and we will even be able to read the data from that column (using undocumented tools) after it was dropped. This is one of the topics which I will demonstrate and explain in my lecture at the data platform summit 2020.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.