Hi azure_learner,
Welcome to Microsoft Q&A, thanks for posting your query.
When designing a high- performance data lake, partitioning is a crucial part for fast data lake. While partitioning with date helps, it might not cover all your aspects. Since you don't have "region" filed to use, try partitioning by the other fields that might have different values.
Here are some key features while designing the high-performance data lake:
- Partition your data by date and another frequently used high cardinality field such as category or type. This will help you in speed up the query performance.
- Use columnar storage formats like Parquet or ORC to optimize the data storage and query execution.
- Apply compression techniques like snappy and Gzip to improve read and write performance and to reduce the storage costs.
- For balancing the performance and cost optimize the file size by avoiding the very large and very small files.
- Ensure the data security through the proper access control and encryption.
- Limit the number of small files and directories to speed up the operations. You can merge the small files into bigger files.
Partitioning should align with how the data is queried. If the current fields are not suitable for partitioning, it may be necessary to verify the new fields that improves your query performance based on your access patterns.
Reference:
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
Please let us know if you have any further queries. I’m happy to assist you further.
Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.