Events
31 Mar, 11 pm - 2 Apr, 11 pm
The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
The Lakehouse and the Delta Lake table format are central to Microsoft Fabric, assuring that tables are optimized for analytics is a key requirement. This guide covers Delta Lake table optimization concepts, configurations and how to apply it to most common Big Data usage patterns.
V-Order is a write time optimization to the parquet file format that enables lightning-fast reads under the Microsoft Fabric compute engines, such as Power BI, SQL, Spark, and others.
Power BI and SQL engines make use of Microsoft Verti-Scan technology and V-Ordered parquet files to achieve in-memory like data access times. Spark and other non-Verti-Scan compute engines also benefit from the V-Ordered files with an average of 10% faster read times, with some scenarios up to 50%.
V-Order works by applying special sorting, row group distribution, dictionary encoding and compression on parquet files, thus requiring less network, disk, and CPU resources in compute engines to read it, providing cost efficiency and performance. V-Order sorting has a 15% impact on average write times but provides up to 50% more compression.
It's 100% open-source parquet format compliant; all parquet engines can read it as a regular parquet files. Delta tables are more efficient than ever; features such as Z-Order are compatible with V-Order. Table properties and optimization commands can be used to control the V-Order of its partitions.
V-Order is applied at the parquet file level. Delta tables and its features, such as Z-Order, compaction, vacuum, time travel, etc. are orthogonal to V-Order, as such, are compatible and can be used together for extra benefits.
V-Order is enabled by default in Microsoft Fabric and in Apache Spark it's controlled by the following configurations.
Configuration | Default value | Description |
---|---|---|
spark.sql.parquet.vorder.enabled | true | Controls session level V-Order writing. |
TBLPROPERTIES(“delta.parquet.vorder.enabled”) | false | Default V-Order mode on tables |
Dataframe writer option: parquet.vorder.enabled | unset | Control V-Order writes using Dataframe writer |
Use the following commands to control usage of V-Order writes.
%%sql
SET spark.sql.parquet.vorder.enabled
%%sql
SET spark.sql.parquet.vorder.enabled=FALSE
Important
When enabled at the session level. All parquet writes are made with V-Order enabled. This includes non-Delta parquet tables and Delta tables with the parquet.vorder.enabled
table property set to either true
or false
.
%%sql
SET spark.sql.parquet.vorder.enabled=TRUE
Enable V-Order table property during table creation:
%%sql
CREATE TABLE person (id INT, name STRING, age INT) USING parquet TBLPROPERTIES("delta.parquet.vorder.enabled" = "true");
Important
When the table property is set to true, INSERT, UPDATE and MERGE commands will behave as expected and perform the write-time optimization. If the V-Order session configuration is set to true or the spark.write enables it, then the writes will be V-Order even if the TBLPROPERTIES is set to false.
Enable or disable V-Order by altering the table property:
%%sql
ALTER TABLE person SET TBLPROPERTIES("delta.parquet.vorder.enabled" = "true");
ALTER TABLE person SET TBLPROPERTIES("delta.parquet.vorder.enabled" = "false");
ALTER TABLE person UNSET TBLPROPERTIES("delta.parquet.vorder.enabled");
After you enable or disable V-Order using table properties, only future writes to the table are affected. Parquet files keep the ordering used when it was created. To change the current physical structure to apply or remove V-Order, read the "Control V-Order when optimizing a table" section below.
All Apache Spark write commands inherit the session setting if not explicit. All following commands write using V-Order by implicitly inheriting the session configuration.
df_source.write\
.format("delta")\
.mode("append")\
.saveAsTable("myschema.mytable")
DeltaTable.createOrReplace(spark)\
.addColumn("id","INT")\
.addColumn("firstName","STRING")\
.addColumn("middleName","STRING")\
.addColumn("lastName","STRING",comment="surname")\
.addColumn("birthDate","TIMESTAMP")\
.location("Files/people")\
.execute()
df_source.write\
.format("delta")\
.mode("overwrite")\
.option("replaceWhere","start_date >= '2017-01-01' AND end_date <= '2017-01-31'")\
.saveAsTable("myschema.mytable")
Important
V-Order only applies to files affected by the predicate.
In a session where spark.sql.parquet.vorder.enabled
is unset or set to false, the following commands would write using V-Order:
df_source.write\
.format("delta")\
.mode("overwrite")\
.option("replaceWhere","start_date >= '2017-01-01' AND end_date <= '2017-01-31'")\
.option("parquet.vorder.enabled ","true")\
.saveAsTable("myschema.mytable")
DeltaTable.createOrReplace(spark)\
.addColumn("id","INT")\
.addColumn("firstName","STRING")\
.addColumn("middleName","STRING")\
.addColumn("lastName","STRING",comment="surname")\
.addColumn("birthDate","TIMESTAMP")\
.option("parquet.vorder.enabled","true")\
.location("Files/people")\
.execute()
Analytical workloads on Big Data processing engines such as Apache Spark perform most efficiently when using standardized larger file sizes. The relation between the file size, the number of files, the number of Spark workers and its configurations, play a critical role on performance. Ingesting data into data lake tables might have the inherited characteristic of constantly writing lots of small files; this scenario is commonly known as the "small file problem."
Optimize Write is a Delta Lake on Microsoft Fabric and Azure Synapse Analytics feature in the Apache Spark engine that reduces the number of files written and aims to increase individual file size of the written data. The target file size can be changed per workload requirements using configurations.
The feature is enabled by default in Microsoft Fabric Runtime for Apache Spark. To learn more about Optimize Write usage scenarios, read the article The need for optimize write on Apache Spark.
Delta Lake MERGE command allows users to update a delta table with advanced conditions. It can update data from a source table, view, or DataFrame into a target table by using MERGE command. However, the current algorithm in the open source distribution of Delta Lake isn't fully optimized for handling unmodified rows. The Microsoft Spark Delta team implemented a custom Low Shuffle Merge optimization, unmodified rows are excluded from an expensive shuffling operation that is needed for updating matched rows.
The implementation is controlled by the spark.microsoft.delta.merge.lowShuffle.enabled
configuration, enabled by default in the runtime. It requires no code changes and is fully compatible with the open-source distribution of Delta Lake. To learn more about Low Shuffle Merge usage scenarios, read the article Low Shuffle Merge optimization on Delta tables.
As Delta tables change, performance and storage cost efficiency tend to degrade for the following reasons:
In order to keep the tables at the best state for best performance, perform bin-compaction, and vacuuming operations in the Delta tables. Bin-compaction is achieved by the OPTIMIZE command; it merges all changes into bigger, consolidated parquet files. Dereferenced storage clean-up is achieved by the VACUUM command.
The table maintenance commands OPTIMIZE and VACUUM can be used within notebooks and Spark Job Definitions, and then orchestrated using platform capabilities. Lakehouse in Fabric offers a functionality to use the user interface to perform ad-hoc table maintenance as explained in the Delta Lake table maintenance article.
Important
Properly designing the table physical structure, based on the ingestion frequency and expected read patterns, is likely more important than running the optimization commands described in this section.
The following command structures bin-compact and rewrite all affected files using V-Order, independent of the TBLPROPERTIES setting or session configuration setting:
%%sql
OPTIMIZE <table|fileOrFolderPath> VORDER;
OPTIMIZE <table|fileOrFolderPath> WHERE <predicate> VORDER;
OPTIMIZE <table|fileOrFolderPath> WHERE <predicate> [ZORDER BY (col_name1, col_name2, ...)] VORDER;
When ZORDER and VORDER are used together, Apache Spark performs bin-compaction, ZORDER, VORDER sequentially.
The following commands bin-compact and rewrite all affected files using the TBLPROPERTIES setting. If TBLPROPERTIES is set true to V-Order, all affected files are written as V-Order. If TBLPROPERTIES is unset or set to false to V-Order, it inherits the session setting; so in order to remove V-Order from the table, set the session configuration to false.
%%sql
OPTIMIZE <table|fileOrFolderPath>;
OPTIMIZE <table|fileOrFolderPath> WHERE predicate;
OPTIMIZE <table|fileOrFolderPath> WHERE predicate [ZORDER BY (col_name1, col_name2, ...)];
Events
31 Mar, 11 pm - 2 Apr, 11 pm
The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayTraining
Module
Optimize performance with Spark and Delta Live Tables - Training
Optimize performance with Spark and Delta Live Tables in Azure Databricks.
Certification
Microsoft Certified: Azure Data Engineer Associate - Certifications
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Documentation
Delta table maintenance in Microsoft Fabric - Microsoft Fabric
Learn about the Lakehouse Delta table maintenance feature. It allows you to efficiently manage Delta tables and to keep them always ready for analytics.
Lakehouse and Delta tables - Microsoft Fabric
The Delta Lake table format is the unified format of the Lakehouse, which is the data architecture platform for managing data in Microsoft Fabric.
Lakehouse Load to Delta Lake tables - Microsoft Fabric
Learn all about the lakehouse Load to Delta Table feature, including feature guidelines and capabilities.