Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article is for data engineers who need to create materialized lake views using PySpark instead of Spark SQL. Use PySpark when your transformations require complex logic, reusable functions, external Python libraries, or custom UDFs that are difficult to express in SQL.
Note
PySpark-based materialized lake views are currently in preview.
Create a materialized lake view
The fmlv module provides the functions for creating materialized lake views using PySpark. Import it before defining a view.
import fmlv
Use the @fmlv.materialized_lake_view decorator to define a materialized lake view. The following code shows the syntax:
@fmlv.materialized_lake_view(
name="<[workspace.lakehouse.schema].MLV_Identifier>",
comment="<Description>",
table_properties={"key1": "value1"},
partition_cols=["<col1>", "<col2>"],
replace=<True|False>
)
@fmlv.check("constraint_name", "condition_expression", "action")
def <function_name>():
dfObject = <logic and definition>
return dfObject
Arguments
The following table describes the parameters for the @fmlv.materialized_lake_view decorator.
| Parameter | Description |
|---|---|
| name | Name of the materialized lake view. Required |
| comment | Description of the materialized lake view. |
| partition_cols | Parameter for creating partitions based on the specified columns. |
| replace | Parameter to indicate whether to replace the existing view definition. Defaults to False. |
| table_properties | List of key-value pairs for defining the properties of the materialized lake view. |
| function definition | Function that returns a Spark DataFrame defining the view logic. Required |
| check | Function to define the data quality constraints. Optional |
Examples
The following examples demonstrate common patterns for creating materialized lake views using PySpark.
Basic definition
This example creates a simple materialized lake view that reads from a bronze table, filters out null values, and adds a calculated column.
import fmlv
@fmlv.materialized_lake_view(name="LH1.silver.customer_silver")
def customer_silver():
df = spark.read.table("bronze.customer_bronze")
cleaned_df = df.filter(F.col("sales").isNotNull())
enriched_df = cleaned_df.withColumn("sales_in_usd", F.col("sales") * 1.0)
return enriched_df
With partitions and table properties
This example creates a materialized lake view with partitioning on year and city columns, and enables change data feed for downstream consumers.
import fmlv
@fmlv.materialized_lake_view(
name="LH1.silver.customer_enriched",
partition_cols=["year", "city"],
table_properties={"delta.enableChangeDataFeed": "true"}
)
def customer_enriched():
df = spark.read.table("LH2.bronze.customer_bronze")
cleaned_df = df.filter(F.col("sales").isNotNull())
enriched_df = cleaned_df.withColumn("sales_in_usd", F.col("sales") * 1.0)
return enriched_df
To create a PySpark-based materialized lake view, run the notebook once to register the definition, and then use lineage scheduling for subsequent refreshes.
Notebook organization and refresh behavior for PySpark materialized lake views
Important
PySpark materialized lake view refresh requires a notebook, unlike Spark SQL. During refresh, Fabric identifies the notebook that defines the view and executes the relevant cells.
Best practices for defining PySpark-based materialized lake views
Follow these guidelines to organize your notebooks and avoid common errors.
- Include all required cells in the same notebook, with dependency cells positioned above the
@fmlvdecorator cell. - Define only one
@fmlvdecorator per cell. Use separate cells for multiple materialized lake views. - Don't delete the notebook where the materialized lake view is defined. Scheduled refresh fails without it.
- You can use multiple notebooks to create separate materialized lake views.
- To update an existing definition, modify the function and run with
replace=True. - After changing the
@fmlvdecorator, re-execute the notebook. Otherwise, the next refresh uses the latest code and might fail. - Don't define materialized lake views in notebooks with unrelated code.
- Don't use variables for
@fmlvparameter values. All parameters must be hardcoded.
Consider the following example of defining a materialized lake view in a notebook:
Cell 1: Definition of function
Define helper functions in cells above the materialized lake view definition.
from pyspark.sql import functions as F
# Simple helper function: concatenate columns as string
def concat_name_age(df):
return df.withColumn(
"name_age",
F.concat(F.coalesce(F.col("name"), F.lit("")),
F.lit("-"),
F.coalesce(F.col("age").cast("string"), F.lit("")))
)
Cell 2: Materialized lake view creation
Create the materialized lake view using the @fmlv.materialized_lake_view decorator. This view uses the helper function defined in Cell 1.
import fmlv
@fmlv.materialized_lake_view(
name="LH1.silver.customer_silver"
)
def customer_silver():
# Read bronze table
bronze_df = spark.read.table("customer_bronze")
# Apply helper function
enriched_df = concat_name_age(bronze_df)
# Add uppercase name
enriched_df = enriched_df.withColumn("name_upper", F.upper(F.col("name")))
return enriched_df
Cell 3: Another materialized lake view creation
Define more materialized lake views in separate cells within the same notebook.
import fmlv
@fmlv.materialized_lake_view(
name="LH1.silver.customer_enriched"
)
def customer_enriched():
df = spark.read.table("customer_bronze")
enriched_df = df.filter(F.col("sales").isNotNull())
return enriched_df
When to use PySpark
PySpark is the better choice when you need:
- Complex transformation logic that is difficult to express in SQL.
- Reusable functions — define common transformation logic once and call it from multiple views.
- External Python libraries — use specialized libraries such as pandas or NumPy that are not available in SQL.
- Custom UDFs — package custom logic as
.jarfiles or Python UDFs when built-in functions are insufficient.
Trade-offs when using PySpark
- No incremental refresh — all refreshes default to full refresh or no refresh. See Current limitations for details.
- Lineage-schedule refresh only — you cannot refresh on-demand via notebook as with Spark SQL-based views.
Decision flowchart
Use this flowchart to decide which approach to use:
┌─────────────────────────────────┐
│ Do you need optimal refresh? │
└───────────────┬─────────────────┘
│
┌───────────────┴───────────────┐
│ │
Yes No
│ │
▼ ▼
┌───────────────┐ ┌───────────────────────────┐
│ Use Spark SQL│ │ Does the transformation │
└───────────────┘ │ require UDFs or external |
| libraries? │
└────────────┬──────────────┘
│
┌────────────┴────────────┐
│ │
Yes No
│ │
▼ ▼
┌─────────────┐ ┌───────────────┐
│ Use PySpark │ │ Use Spark SQL │
└─────────────┘ └───────────────┘
Drop a materialized lake view
You can drop a materialized lake view by using the Delete option in the lakehouse object explorer or by running the following command in the notebook:
spark.sql("DROP MATERIALIZED LAKE VIEW IF EXISTS <materialized_lake_view_Identifier>")
Here's an example:
spark.sql("DROP MATERIALIZED LAKE VIEW IF EXISTS silver.customer_enriched")
Note
Dropping or renaming a materialized lake view affects the lineage view and scheduled refresh. Be sure to update the reference in all dependent materialized lake views.
Current limitations
Incremental refresh strategy in optimal refresh isn't supported for PySpark based materialized lake views; all refreshes either default to a full refresh or no refresh
You can only refresh PySpark materialized lake views through the lineage schedule.
Renaming is supported only through lakehouse explorer. To rename via notebook, drop and recreate the view with the new name.
Only
%%pysparkand%%sqlmagic commands are supported, and they must appear at the top of a notebook cell. Magic commands placed elsewhere within the same cell aren't supported.The name of a materialized lake view can include special characters except for periods.
The
@fmlvdecorator doesn't support dynamic parameters or variables. All parameters must be hardcoded in the decorator definition. For example, the following code doesn't work because it uses a variable to pass the name parameter value:view_name = "LH1.silver.customer_silver" @fmlv.materialized_lake_view(name=view_name) def customer_silver(): df = spark.read.table("bronze.customer_bronze") cleaned_df = df.filter(F.col("sales").isNotNull()) enriched_df = cleaned_df.withColumn("sales_in_usd", F.col("sales") * 1.0) return enriched_df