What's new and planned for Synapse Data Engineering in Microsoft Fabric

Important

The release plans describe functionality that may or may not have been released yet. The delivery timelines and projected functionality may change or may not ship. Refer to Microsoft policy for more information.

Synapse Data Engineering empowers data engineers to be able to transform their data at scale using Spark and build out their lakehouse architecture. There are five key areas of Synapse Data Engineering that you can start using for your projects today:

Lakehouse for all your organizational data: The lakehouse is a new item in Fabric that combines the best of the data lake and the data warehouse in a single experience. It enables users to ingest, prepare, and share organizational data in an open format in the lake. Later you can access it through multiple engines such as Spark, T-SQL, and Power BI. It provides various data integration options such as dataflows and pipelines, shortcuts to external data sources, and data product sharing capabilities.

Performant Spark engine & runtime: Synapse Data engineering provides customers with an optimized Spark runtime with the latest versions of Spark (3.3.1), Delta (2.2), and Python (3.10). It uses Delta Lake as the common table format for all engines, enabling easy data sharing and reporting with no data movement. The runtime comes with Spark optimizations, enhancing your query performance without any configurations. It also offers starter pools and high-concurrency mode to speed up and reuse your Spark sessions, saving you time and cost.

Spark Admin & configurations: Workspace admins with appropriate permissions can create and configure custom pools to optimize the performance and cost of their Spark workloads. They can also install libraries, select the runtime version, and set Spark properties to customize the Spark environment. These settings apply to all notebooks and Spark jobs in the workspace, unless otherwise specified.

Developer Experience: Developers can use notebooks, Spark jobs, or their preferred IDE to author and execute Spark code in Fabric. They can natively access the lakehouse data, collaborate with others, install libraries, track history, do in-line monitoring, and get recommendations from the Spark advisor. They can also use Data Wrangler to easily prepare data with a low-code UI.

Platform Integration: All Synapse data engineering items, including notebooks, Spark jobs, pipelines, and lakehouses, are integrated deeply into the Fabric platform (enterprise information management capabilities, lineage, sensitivity labels, and endorsements). This integration continues to deepen this semester with many new investments.

To learn more, see the documentation and visit our announcement blog.

Investment areas

Feature Estimated release timeline
Lakehouse data security (Public Preview) Q2 2024
Schema support for Lakehouse Q2 2024
Spark autotune (Public Preview) Q1 2024
High concurrency in pipelines Q2 2024
Optimistic Job Admission for Fabric Spark Q2 2024
Single Node Support for Starter Pools Q2 2024
Job Queueing for Notebook Jobs Q3 2024
Public monitoring APIs Q2 2024
Snapshot for in progress Notebook runs Q2 2024
Spark Policy management Q2 2024
Create and attach environments (GA) Q2 2024
Dynamic lineage of data engineering items Q3 2024
Spark Connector for Fabric Data Warehouse (Public Preview) Q2 2024
Lakehouse metadata on git and deployment pipelines (Public Preview Q2 2024
Load to Tables improvements and automatic mode Q3 2024
Delta Lake improvements in Spark experiences Q3 2024
Lakehouse Automatic Table Maintenance Q4 2024
T-SQL notebook (Public Preview) Q3 2024
Python notebook (Public Preview) Q4 2024
VS Code for the Web - debugging support (Public Preview) Q3 2024

Lakehouse data security (Public Preview)

Estimated release timeline: Q2 2024

You'll have the ability to apply file, folder, and table (or object level) security in the lakehouse. You can also control who can access data in the lakehouse, and the level of permissions they have. For example, You can grant read permissions on files, folders, and tables. Once permissions are applied, they're automatically synchronized across all engines. Which means, that permissions are consistent across Spark, SQL, Power BI, and external engines.

Schema support for Lakehouse

Estimated release timeline: Q2 2024

The lakehouse supports 3-part naming convention. It enables you to add schemas to your lakehouses, which is consistent with the current warehouse experience.

Spark autotune (Public Preview)

Estimated release timeline: Q1 2024

Autotune uses machine learning to automatically analyze previous runs of your Spark jobs and tunes the configurations to optimize the performance. It configures how your data is partitioned, joined, and read by Spark. This way it will significantly improve the performance. We have seen customer jobs run 2x faster with this capability.

High concurrency in pipelines

Estimated release timeline: Q2 2024

In addition to high concurrency in notebooks, we'll also enable high concurrency in pipelines. This capability will allow you to run multiple notebooks in a pipeline with a single session.

Optimistic Job Admission for Fabric Spark

Estimated release timeline: Q2 2024

With Optimistic Job Admission, Fabric Spark only reserves the minimum number of cores that a job needs to start, based on the minimum number of nodes that the job can scale down to. This allows more jobs to be admitted if there are enough resources to meet the minimum requirements. If a job needs to scale up later, the scale up requests is approved or rejected based on the available cores in capacity.

Single Node Support for Starter Pools

Estimated release timeline: Q2 2024

This feature lets you set your starter pool to max one node and get super-fast session start times for your Spark sessions. With this new support that’s been introduced, the system allocates the driver and executor with 4 cores each and 56 GB in memory, so that they fit within a single medium node configuration for Starter Pools. This improves session start times to ~5 seconds in single node starter pool configurations.

Job Queueing for Notebook Jobs

Estimated release timeline: Q1 2024

This feature lets you set your starter pool to max one node and get super-fast session start times for your Spark sessions. With this new support that’s been introduced, the system allocates the driver and executor with 4 cores each and 56 GB in memory, so that they fit within a single medium node configuration for Starter Pools. This improves session start times to ~5 seconds in single node starter pool configurations.

Public monitoring APIs

Estimated release timeline: Q2 2024

The public monitoring APIs would allow you to programmatically retrieve the status of Spark jobs, job summaries, and the corresponding driver and executor logs.

Snapshot for in progress Notebook runs

Estimated release timeline: Q2 2024

In addition to snapshots for completed Notebook runs, snapshots for in-progress notebook runs allow you to view the original code, cell output, and monitor the status and progress at the Notebook cell level.

Spark Policy management

Estimated release timeline: Q2 2024

Workspace admins will be able to author and enforce policies based on Spark properties, ensuring that your workloads comply with certain rules. For example, they can limit the number of resources, the time that a workload can consume, or prevent users from changing certain Spark settings. This will enhance the governance and security of your Spark workloads.

Create and attach environments (GA)

Estimated release timeline: Q2 2024

To customize your Spark experiences at a more granular level, you can create and attach environments to your notebooks and Spark jobs. In an environment, you can install libraries, configure a new pool, set Spark properties, and upload scripts to a file system. This gives you more flexibility and control over your Spark workloads, without affecting the default settings of the workspace. As part of GA, we're making various improvements to environments including API support and CI/CD integration.

Dynamic lineage of data engineering items

Estimated release timeline: Q3 2024

You'll be able to trace the lineage within Fabric across the code items such as notebooks & Spark jobs, and data items such as a lakehouse. This lineage will be dynamic, which means meaning if the code adds or removes references to lakehouses, it will be reflected in the lineage view.

Spark Connector for Fabric Data Warehouse (Public Preview)

Estimated release timeline: Q2 2024

Spark Connector for Fabric DW (Data Warehouse) empowers a Spark developer or a data scientist to access and work on data from Fabric Data Warehouse with a simplified Spark API, which literally works with just one line of code. It offers an ability to query the data, in parallel, from Fabric data warehouse so that it scales with increasing data volume and honors security model (OLS/RLS/CLS) defined at the data warehouse level while accessing the table or view. This first release will support reading data only and the support for writing data back will be coming soon.

Lakehouse metadata on git and deployment pipelines (Public Preview)

Estimated release timeline: Q2 2024

To deliver a compelling application lifecycle management story, tracking object metadata in git and supporting deployment pipelines is imperative. In the Data Engineering modules, as workspaces are integrated to git, OneLake Shortcuts and Folders will automatically be deployed across pipeline stages and workspaces. Shortcut connections can be remapped across stages, assuring proper isolation and environment segmentation customers expect. It will be possible to integrate all Lakehouse object types, such as tables, views, and others to Azure DevOps and Pipelines by using a REST API flow by Data Engineers and DevOps Engineers. Data copies can be achieved by invoking Notebooks, Spark Job Definitions and Data Pipelines using the same APIs.

Load to Tables improvements and automatic mode

Estimated release timeline: Q3 2024

Load to Tables API experiences is being expanded to support key customer requests. It will support the MERGE pattern, which is the building block of Data Lakes and Lakehouses, enabling change data capture constructs. By using the right set of table properties in your Delta Lake table in Lakehouse, it will be possible to orchestrate all common ingestion patterns: APPEND, OVERWRITE, and MERGE. We're also releasing an automatic table loading functionality that allows you to set and forget your predefined table load pattern, and Lakehouse will detect changes and run it for you.

Delta Lake improvements in Spark experiences

Estimated release timeline: Q3 2024

Having proper defaults and aligning with the latest standards are of the utmost importance to Delta Lake standards in Microsoft Fabric. INT64 will be the new default encoding type for all timestamp values. This moves away from INT96 encodings, which the Apache Parquet deprecated years ago. The changes don't affect any reading capabilities, it's transparent and compatible by default, but ensures that all new parquet files in your Delta Lake table are written in a more efficient and future proof way.

We're also releasing a faster implementation of the OPTIMIZE command, making it skip already V-Ordered files. Spark will also enable column mappings and deletion vectors by default in all new tables. Column mappings enable full character support on column names, and metadata only column operations, making it super-fast. Deletion vectors make all DML operations (UPDATE, DELETE, and MERGE) orders of magnitude faster.

Lakehouse Automatic Table Maintenance

Estimated release timeline: Q4 2024

An optimized Delta Lake table is critical for analytics performance across Microsoft Fabric experiences. It will be possible, using API, code, and the Data Engineering Lakehouse user interface experience, to configure scheduled and fully automatic Delta Lake table maintenance. The capability will compact your Delta Lake table parquet files for better analytics, delete stale references to cleanup storage, and merge deletion vectors files back into the main parquet files. All this to guarantee the most optimized Delta Lake table state. In fully automatic mode, the feature will track changes to the table and detect the most appropriate time to optimize it and easy.

T-SQL notebook (Public Preview)

Estimated release timeline: Q3 2024

Fabric notebooks support T-SQL language to consume data against Data Warehouse. SQL developers will be able to read data against the SQL endpoint and will have full read and write capability when working against Data Warehouse. T-SQL Notebooks offer a great authoring alternative to the existing tools to SQL users and include Fabric native features, like, sharing, GIT integration and collaboration.

Python notebook (Public Preview)

Estimated release timeline: Q4 2024

Fabric notebooks support pure Python experience. This new solution is targeting BI developers and Data Scientists working with smaller datasets (up to a few GB) and using Pandas, and Python as their primary language. Through this new experience, they'll be able to benefit from native Python language and its native features and libraries out of the box, will be able to switch from a Python version to another (initially two versions will be supported) and finally will benefit with a better resource utilization by using a smaller 2VCore machine.

VS Code for the Web - debugging support (Public Preview)

Estimated release timeline: Q3 2024

Visual Studio Code for the Web is currently supported in Preview for authoring and execution scenarios. We add to the list of capabilities the ability to debug code using this extension for notebook.