paginated query to a big external_table

_Esteban Bett 35 Reputation points
2023-10-26T20:08:43.8+00:00

Use case: Azure data explorer external tables connected with Azure data lake storage.

I am consuming big external tables (DataLake storage >1TB) from java using JPA/Hibernate, to ADX cluster, following these patterns:

https://www.baeldung.com/spring-data-jpa-iterate-large-result-sets

The partition keys are grid_id_partition and year_partition.

The first page goes faster because it just add a "top" to the select.

-- Page 0
select top(?) * 
from TwcHistoricalDaily 
where grid_id_partition = ? and year_partition = ? and metric_date >= ? and metric_date <= ?


The problem comes when I try to fetch next pages, JPA make this query and it takes so much time to complete:

-- Page > 0
with query_ as (    
 select row_.*,row_number() over (order by current_timestamp) as rownumber_ 
 from (        
    select * from TwcHistoricalDaily         
    where grid_id_partition = ? and year_partition = ? and metric_date >= ? and metric_date <= ?             ) row_
) 
select * from query_ where rownumber_>=? and rownumber_<?

I am using SqlServer driver with page size of 50 .

Java query:

@Query(value = "select * " + FROM_QUERY, nativeQuery = true)
Slice<TwcHistoricalDailyMetric> findByGridIdPartitionAndYearPartitionAndDateBetween(
  String gridIdPartition, 
  String yearPartition, 
  Date startDate, 
  Date endDate, 
  Pageable pageable);

What we can do to improve query performance to >1 pages?

Thanks

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
577 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Barry Evanz 235 Reputation points
    2023-10-27T14:51:50.1266667+00:00

    Certainly, to improve the performance of paginated queries for large external tables in a Java application using JPA/Hibernate, consider using a windowed function instead of the row_number() function in the SQL query. The row_number() function requires scanning the entire table, leading to performance issues. Instead, use the lag() windowed function to calculate the previous metric date within partitioned data. This optimized query enhances data retrieval, especially for multiple pages, without the need for a full table scan. Here's the SQL code:

    with query_ as (
      select
        row_.*,
        lag(metric_date, 1, NULL) over (partition by grid_id_partition, year_partition order by metric_date) as previous_metric_date
      from TwcHistoricalDaily
      where grid_id_partition = ? and year_partition = ? and metric_date >= ? and metric_date <= ?
    )
    select *
    from query_
    where previous_metric_date is null or previous_metric_date < ?
    order by metric_date
    limit ?
    

    This code is a concise and effective solution to enhance the performance of paginated queries for large datasets, ensuring quicker and more resource-efficient data retrieval.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.