Identity Column in Synapse

Question

Hello,
In Synpase , an identity column is not an auto incremental column similar to normal SQL database identity columns.
It provides a unique integer value but not an incremented one like (if previous value is 10 then next value should be 11, it may provide any random value)

We have a need wherein based on a batch job data, we need to assign a new column that would have an incremental unique integer value.

So is there any way we can achieve this? or any property we need to enable on identity column to make this +1 incremental rather than any random unique value

Accepted Answer

Hi @Nandan Hegde ,

Welcome to Microsoft Q&A Platform.

Thanks for posting the query. As per my understanding, there is no property to enable +1 incremental value for identity columns in Synapse, I will re-confirm with Product teams and update you if they suggest any automatic way. One possible workaround is to take the max(id) column and incrementing before inserting the data into a column.

Hope this helps! Please let us know if our understanding of scenario is incorrect and we will be glad to assist.

DECLARE @MaxRowId int  
SET @MaxRowId = (SELECT ISNULL(MAX(id),0)+1 FROM dbo.test)

Answer

Hi,

Sorry for the very late respond. It is impossible (at least for me) to follow my threads in the QnA system, since there is no option to filter only threads which I participate with, sorted by the updated time as in other forums.

a unique integer value created for an integer column that cannot be derived back.

This sound like a simple surrogate key which can be covered by IDENTITY in Azure Synapse perfectly (this is the exact use of IDENTITY)

We have 2 environments which we keep in sync. And in synapse as the identity column is random , the data across the 2 environments might differ.

I am not sure how do you syn the data, but why you can sync other columns and not the IDENTITY column?!?

What is the different between Synapse and Azure database in this point?!? How do you plan to sync two Databases who support ordered IDENTITY like in Azure Database or on-premises SQL Server Database? Do you count on inserting new rows without the value of the IDENTITY and hope that the value will be the same?

Answer

Hi,

whereas in case of any Azure SQLDatabase, it would be incremental and since the source file is same across both the environments , the values would always be in sync

You assumption regarding what will be in the Azure Database or SQL Server on-premises is totally wrong. IDENTITY does not guarantees that the values will be consecutive!!! This means that your expected result is wrong. In time, you will probably get different IDENTITY values for the same rows in the different databases.

This is actually well documented in the IDENTITY document of SQL Server

https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity-property?view=sql-server-ver15

It seems like your design of the system is wrong.

If you need to sync data then you should sync it from one place (source data) to the other place (target data). You cannot count on two separate actions which insert data to have the same IDENTITY values unless you explicitly INSERT these value.

And if you explicitly INSERT these values then it is working exactly the same in SQL Server as it is in Azure Data warehouse.

Answer

Good day,

Let me be a bit harsh and start with my personal golden rule, saying that using Azure Data Warehouse in most cases if you think about using increment integers (in the same way the identity property is used in SQL Server), then you probably do not understand the idea of big data and data warehouse. This is not necessarily true to you but this is my golden rule which I found usually fit. This implies improper architecture of your data warehouse system or using the wrong tool for the task!

It is a red light for me if one of my client have this request while designing a system based on such database (like mongoDB, Azure Data warehouse, CosmosDB, Azure TableStorage ...).

Why? Since these are Big Data Warehouse solutions and are NOT equivalent of using a database which is stored and managed in a single machine. These solutions are meant to be hosted and managed by multiple machines and be used by millions of users at the same time. There is a very good reason why all these products do not support this feature in the same way SQL Server implement IDENTITY.

You need this ID only for analysis, not data storage. It's unwise to use incremental sequences integer in most cases when you have millions of users using a database which is sharded on multiple machines.

it may provide any random value

It is not fully random but as you said it is not guarantee to be incremented.

So is there any way we can achieve this?

There is always a way to "achieve" specific request but it usually not free and in this case you might pay in performance dramatically.

You should remember that your so call database in Data warehouse is in fact a sharded database which might be divided in multiple machines. To manage something like "identity" in SQL Server means that you must sync between all the machines.

One possible workaround is to take the max(id) column and incrementing before inserting the data into a column.

(1) And what is two users try to INSERT row at the same time? And what a million users do it?!?

In order to implement a solution which you can count on, you will have to manage locks to prevent users from reading the max current value at the same time. If user A read current max value and user B read the same value and if you only implement locks on write then these two users read the same max value before the first insert and both will try to insert the same "next value". It is something that simple databases like SQL Server do implement but it is much simpler to implement it when you have a single machine and all is managed in that single machine.

(2) Using max value is not recommended for most cases. This approach also called "findAndModify" approach. You pay in this case for the "find" which is not recommended usually.

Another approach which is much better is to use COUNT + 1. In order to count the rows, the server do not really need to read any data from the rows! This can be implemented if you do not delete rows from your database, which by the way... is something that recommended anyway (I highly recommend NOT to delete rows in data warehouse usually!).

you can use function and a counters sequence (store the value outside the records SET). This way you do not need to use the "findAndModify" approach. In fact, this is common recommended solution for mongoDB as well for example (if you must have such feature!)

This feature might be supported in the future, but I doubt if it will, and even if it will then it will probably reduce performance.

Share via

Identity Column in Synapse

3 additional answers