Share via

Databricks SQL warehouse with inconsistent order of data

Sandeep 21 Reputation points
2023-11-13T18:22:02.4366667+00:00

We have a use case where clients are reading the data from Databricks SQL warehouse.

Client fire a query to read 1000 records per request from a table of 60k records.

Below is the sample query they use

--> select id,name,city from employee where name is not null LIMIT 1000 OFFSET 1000

When client calls 60times to read the complete data set they are seeing duplicates as though the actual data set is not having any duplicates.

What i read is when we use offset clause order-by is mandatory but we cannot ask the consumers to do that. Is there is way where we can maintain the ordering with in warehouse than asking client to update the query ?

Azure Databricks
Azure Databricks

An Apache Spark-based analytics platform optimized for Azure.


1 answer

Sort by: Most helpful
  1. Boris Von Dahle 3,226 Reputation points
    2023-11-13T20:12:51.3633333+00:00

    Hello,

    In order to maintain the ordering, you could create a view or a materialized view on the employee table that includes an ORDER BY clause :

    CREATE VIEW ordered_employee AS
    SELECT id, name, city
    FROM employee
    WHERE name IS NOT NULL
    ORDER BY id;
    

    Otherwise if you have control over the API or middleware that clients use to access the Databricks SQL warehouse, you could implement a layer that automatically adds an ORDER BY clause to incoming queries.

    Also you could create a stored procedure that clients can call instead of querying.

    Hope this helps

    If this answer helped you, please accept it, so others can benefit from it too.

    Regards

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.