partition by condition

Shambhu Rai 1,411 Reputation points
2023-07-26T14:29:41.2833333+00:00

Hi Expert,

i am having 30 millions records of last 20 years on day wise and wants to fetch any 4 months records in fraction of seconds .. is there any way to get it

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
576 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
{count} votes

1 answer

Sort by: Most helpful
  1. QuantumCache 20,366 Reputation points Moderator
    2023-07-26T20:58:21.14+00:00

    Hello @Shambhu Rai Thanks for posting this query on this forum.

    There are many factors in this scenario which guides the implementation, which also requires iterations!

    Sharing few reading resources which might be helpful with initial query!

    Dated Oct 14 2020: Performance Tuning ADF Data Flow Sources and Sinks :By Mark Kromer

    Performance tuning steps

    Use partitioning: You can partition your data based on the date column so that each partition contains data for a specific date range. This will allow you to query only the partitions that contain the data you need, which will significantly reduce the amount of data that needs to be scanned. ADF supports partitioning in various data sources, including Azure Blob Storage, Azure Data Lake Storage, and Azure SQL Database.

    For source partitioning, the I/O of the SQL Server is the bottleneck. Adding too many partitions may saturate your source database. Generally four or five partitions is ideal when using this option.

    Source partitioning

    Also check: Isolation level Assuming the source is Azure SQL

    Use indexing: You can create indexes on the date column to speed up the query performance. Indexing allows the database engine to quickly locate the data that matches the query criteria, which can significantly reduce the query execution time. ADF supports indexing in various data sources, including Azure SQL Database and Azure Cosmos DB.

    Use caching: You can cache the frequently accessed data in memory or disk to reduce the query execution time. Caching allows the data to be retrieved from the cache instead of the data source, which can significantly reduce the query latency. ADF supports caching in various data sources, including Azure Redis Cache and Azure SQL Database.

    Use parallelism: You can split the query into multiple smaller queries and execute them in parallel to speed up the query performance. Parallelism allows the queries to be executed concurrently, which can significantly reduce the query execution time. ADF supports parallelism in various data sources, including Azure SQL Database and Azure Data Lake Storage.

    Mapping data flows performance and tuning guide

    -->Scale up the self-hosted IR: Increase the number of concurrent jobs that can run on a node

    -->Scale out the self-hosted IR: Add more nodes (machines)

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.