paginated query to a big external_table

Question

paginated query to a big external_table

_Esteban Bett 35

Use case: Azure data explorer external tables connected with Azure data lake storage.

I am consuming big external tables (DataLake storage >1TB) from java using JPA/Hibernate, to ADX cluster, following these patterns:

https://www.baeldung.com/spring-data-jpa-iterate-large-result-sets

The partition keys are grid_id_partition and year_partition.

The first page goes faster because it just add a "top" to the select.

-- Page 0
select top(?) * 
from TwcHistoricalDaily 
where grid_id_partition = ? and year_partition = ? and metric_date >= ? and metric_date <= ?

The problem comes when I try to fetch next pages, JPA make this query and it takes so much time to complete:

-- Page > 0
with query_ as (    
 select row_.*,row_number() over (order by current_timestamp) as rownumber_ 
 from (        
    select * from TwcHistoricalDaily         
    where grid_id_partition = ? and year_partition = ? and metric_date >= ? and metric_date <= ?             ) row_
) 
select * from query_ where rownumber_>=? and rownumber_<?

I am using SqlServer driver with page size of 50 .

Java query:

@Query(value = "select * " + FROM_QUERY, nativeQuery = true)
Slice<TwcHistoricalDailyMetric> findByGridIdPartitionAndYearPartitionAndDateBetween(
  String gridIdPartition, 
  String yearPartition, 
  Date startDate, 
  Date endDate, 
  Pageable pageable);

What we can do to improve query performance to >1 pages?

Thanks

_Esteban Bett 35 Reputation points

2023-10-26T20:12:12.91+00:00

Could "over (order by current_timestamp)" part causing overhead?
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-10-27T03:43:20.4933333+00:00

@_Esteban Bett - Thanks for the question and using MS Q&A platform.

Could you please help us with which Azure service are you using?
_Esteban Bett 35 Reputation points

2023-10-27T14:22:08.4033333+00:00

Azure data explorer external tables connected with Azure data lake storage ( I updated the question with more details)
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-10-30T11:13:58.14+00:00

@_Esteban Bett - Just checking in to see if the below answer provided by @Barry Evanz helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
_Esteban Bett 35 Reputation points

2023-10-30T18:07:57.64+00:00

Could you provide me an example query to get the page 5 using page size of 100?

1 answer

Your answer

_Esteban Bett 35 Reputation points

2023-10-26T20:12:12.91+00:00

Could "over (order by current_timestamp)" part causing overhead?
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-10-27T03:43:20.4933333+00:00

@_Esteban Bett - Thanks for the question and using MS Q&A platform.

Could you please help us with which Azure service are you using?
_Esteban Bett 35 Reputation points

2023-10-27T14:22:08.4033333+00:00

Azure data explorer external tables connected with Azure data lake storage ( I updated the question with more details)
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-10-30T11:13:58.14+00:00

@_Esteban Bett - Just checking in to see if the below answer provided by @Barry Evanz helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
_Esteban Bett 35 Reputation points

2023-10-30T18:07:57.64+00:00

Could you provide me an example query to get the page 5 using page size of 100?

Answer 1

Barry Evanz 235

Certainly, to improve the performance of paginated queries for large external tables in a Java application using JPA/Hibernate, consider using a windowed function instead of the row_number() function in the SQL query. The row_number() function requires scanning the entire table, leading to performance issues. Instead, use the lag() windowed function to calculate the previous metric date within partitioned data. This optimized query enhances data retrieval, especially for multiple pages, without the need for a full table scan. Here's the SQL code:

with query_ as (
  select
    row_.*,
    lag(metric_date, 1, NULL) over (partition by grid_id_partition, year_partition order by metric_date) as previous_metric_date
  from TwcHistoricalDaily
  where grid_id_partition = ? and year_partition = ? and metric_date >= ? and metric_date <= ?
)
select *
from query_
where previous_metric_date is null or previous_metric_date < ?
order by metric_date
limit ?

This code is a concise and effective solution to enhance the performance of paginated queries for large datasets, ensuring quicker and more resource-efficient data retrieval.

_Esteban Bett 35 Reputation points

2023-10-30T17:51:16.7166667+00:00

Thanks for the response, what should I set in this parameter: "previous_metric_date < ?" It should be start or end date?, It is not clear to me how I can do pagination with this query, could you provide an example, let say page 5 with pagesize = 50?
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-11-06T06:39:55.88+00:00
@_Esteban Bett - The previous_metric_date < ? condition in your query is used to filter the results based on the metric_date column. The ? is a placeholder for the value of the previous_metric_date parameter that you pass to the query. This parameter should be set to the metric_date value of the last record on the previous page.

To implement pagination with this query, you can use the row_number() function to assign a unique number to each row in the result set, and then filter the rows based on this number. Here's an example of how you can modify your query to fetch page 5 with a page size of 50:

with query_ as ( select row_.*, row_number() over (order by metric_date) as rownumber_ from ( select * from TwcHistoricalDaily where grid_id_partition = ? and year_partition = ? and metric_date >= ? and metric_date <= ? ) row_ where previous_metric_date < ? ) select * from query_ where rownumber_ >= 201 and rownumber_ <= 250

In this example, the previous_metric_date parameter should be set to the metric_date value of the last record on page 4. The rownumber_ column is used to filter the rows based on their position in the result set. The where clause filters the rows to include only those with rownumber_ between 201 and 250, which corresponds to page 5 with a page size of 50.

Share via

paginated query to a big external_table

1 answer

Your answer