Data preparation cheat sheet

High-end BI performance depends on how effectively data is prepared and delivered from the Lakehouse. By adopting architectural patterns, applying a semantic structure, and using targeted optimizations, you can reduce query complexity, improve dashboard responsiveness, and lower compute costs.

The following table summarizes recommended practices, their expected impact, related documentation, and associated action items. This content is intended for data engineers, BI developers, and dashboard authors who design, optimize, and maintain analytics workloads in the Lakehouse.

Data prep

Best practice	Impact	Docs	Action items
Adopt a medallion architecture	Speeds up turning raw data into ready-to-use, reliable data products for easy consumption.	Medallion Architecture Medallion Architecture glossary Data warehousing concepts	Review and implement medallion layers
Use liquid clustering	Improves query performance with file and data skipping.	Liquid clustering	Apply to large tables with filter patterns
Use managed tables	Azure Databricks auto-governs and optimizes the storage layer and query performance.	Unity Catalog managed tables in Azure Databricks	Create managed tables for your data
Use predictive optimization or optimize tables manually	Enables better query performance by optimizing file sizes and layout, deleting old files, and updating statistics.	Predictive optimization Optimize file layout Analyze table compute storage metrics	Enable for production tables or schedule regular optimization and analyze tables after data changes
Model data in a star schema pattern	Makes the data easy to query and consume.	Star schema glossary Five simple steps for implementing a star schema	Design fact and dimension tables
Avoid wide data types and high-cardinality columns	Optimizes data model size and memory consumption, and improves query efficiency.	Optimize data workloads guide	Review data types and cardinality
Declare Primary and Foreign Keys (with RELY)	Optimizes queries by eliminating unnecessary joins and aggregations.	Query optimization with constraints	Define keys on fact and dimension tables
Use auto-generated columns	Reduces the need to calculate values at query time.	Generated columns	Identify frequently calculated fields
Use materialized views and persisted tables	Improves performance by pre-aggregating data for the most common and resource-intensive queries.	Tables Materialized views Use materialized views in Databricks SQL	Create aggregated views for common queries

Feedback

Was this page helpful?

Last updated on 2026-02-17

Share via

Data preparation cheat sheet

Data prep

Feedback

Additional resources