Breyta

Deila með


Compare Fabric Data Engineering and Azure Synapse Spark

This comparison between Fabric Data Engineering and Azure Synapse Spark provides a summary of key features and an in-depth analysis across various categories, which include Spark pools, configuration, libraries, notebooks, and Spark job definitions.

The following table compares Azure Synapse Spark and Fabric Spark across different categories:

Category Azure Synapse Spark Fabric Spark
Spark pools Spark pool
-
-
Starter pool / Custom pool
V-Order
High concurrency
Spark configurations Pool level
Notebook or Spark job definition level
Environment level
Notebook or Spark job definition level
Spark libraries Workspace level packages
Pool level packages
Inline packages
-
Environment libraries
Inline libraries
Resources Notebook (Python, Scala, Spark SQL, R, .NET)
Spark job definition (Python, Scala, .NET)
Synapse data pipelines
Pipeline activities (notebook, SJD)
Notebook (Python, Scala, Spark SQL, R)
Spark job definition (Python, Scala, R)
Data Factory data pipelines
Pipeline activities (notebook, SJD)
Data Primary storage (ADLS Gen2)
Data residency (cluster/region based)
Primary storage (OneLake)
Data residency (capacity/region based)
Metadata Internal Hive Metastore (HMS)
External HMS (using Azure SQL DB)
Internal HMS (lakehouse)
-
Connections Connector type (linked services)
Data sources
Data source conn. with workspace identity
Connector type (DMTS)
Data sources
-
Security RBAC and access control
Storage ACLs (ADLS Gen2)
Private Links
Managed VNet (network isolation)
Synapse workspace identity
Data Exfiltration Protection (DEP)
Service tags
Key Vault (via mssparkutils/ linked service)
RBAC and access control
OneLake RBAC
Private Links
Managed VNet
Workspace identity
-
Service tags
Key Vault (via mssparkutils)
DevOps Azure DevOps integration
CI/CD (no built-in support)
Azure DevOps integration
Deployment pipelines
Developer experience IDE integration (IntelliJ)
Synapse Studio UI
Collaboration (workspaces)
Livy API
API/SDK
mssparkutils
IDE integration (VS Code)
Fabric UI
Collaboration (workspaces and sharing)
-
API/SDK
mssparkutils
Logging and monitoring Spark Advisor
Built-in monitoring pools and jobs (through Synapse Studio)
Spark history server
Prometheus/Grafana
Log Analytics
Storage Account
Event Hubs
Spark Advisor
Built-in monitoring pools and jobs (through Monitoring hub)
Spark history server
-
-
-
-
Business continuity and disaster recovery (BCDR) BCDR (data) ADLS Gen2 BCDR (data) OneLake

Considerations and limitations:

  • DMTS integration: You can't use the DMTS via notebooks and Spark job definitions.

  • Workload level RBAC: Fabric supports four different workspace roles. Fore more information, see Roles in workspaces in Microsoft Fabric.

  • Managed identity: Currently, Fabric doesn't support running notebooks and Spark job definitions using the workspace identity or managed identity for Azure KeyVault in notebooks.

  • CI/CD: You can use the Fabric API/SDK and deployment pipelines.

  • Livy API and how to submit and manage Spark jobs: Livy API is in the roadmap but not exposed yet in Fabric. You must create notebooks and Spark job definitions with the Fabric UI.

  • Spark logs and metrics: In Azure Synapse you can emit Spark logs and metrics to your own storage, such as Log Analytics, blob, and Event Hubs. You can also get a list of Spark applications for the workspace from the API. Currently, both of these capabilities aren't available in Fabric.

  • Other considerations:

    • JDBC: JDBC connection support isn't currently available in Fabric.

Spark pool comparison

The following table compares Azure Synapse Spark and Fabric Spark pools.

Spark setting Azure Synapse Spark Fabric Spark
Live pool (pre-warm instances) - Yes, Starter pools
Custom pool Yes Yes
Spark versions (runtime) 2.4, 3.1, 3.2, 3.3, 3.4 3.3, 3.4, 3.5
Autoscale Yes Yes
Dynamic allocation of executors Yes, up to 200 Yes, based on capacity
Adjustable node sizes Yes, 3-200 Yes, 1-based on capacity
Minimum node configuration 3 nodes 1 node
Node size family Memory Optimized, GPU accelerated Memory Optimized
Node size Small-XXXLarge Small-XXLarge
Autopause Yes, customizable minimum 5 minutes Yes, noncustomizable 2 minutes
High concurrency No Yes
V-Order No Yes
Spark autotune No Yes
Native Execution Engine No Yes
Concurrency limits Fixed Variable based on capacity
Multiple Spark pools Yes Yes (environments)
Intelligent cache Yes Yes
API/SDK support Yes Yes
  • Runtime: Fabric doesn't support Spark 2.4, 3.1, and 3.2 versions. Fabric Spark supports Spark 3.3 with Delta 2.2 within Runtime 1.1, Spark 3.4 with Delta 2.4 within Runtime 1.2 and Spark 3.5 with Delta 3.1 within Runtime 1.3.

  • Autoscale: In Azure Synapse Spark, the pool can scale up to 200 nodes regardless of the node size. In Fabric, the maximum number of nodes is subjected to node size and provisioned capacity. See the following example for the F64 SKU.

    Spark pool size Azure Synapse Spark Fabric Spark (Custom Pool, SKU F64)
    Small Min: 3, Max: 200 Min: 1, Max: 32
    Medium Min: 3, Max: 200 Min: 1, Max: 16
    Large Min: 3, Max: 200 Min: 1, Max: 8
    X-Large Min: 3, Max: 200 Min: 1, Max: 4
    XX-Large Min: 3, Max: 200 Min: 1, Max: 2
  • Adjustable node sizes: In Azure Synapse Spark, you can go up to 200 nodes. In Fabric, the number of nodes you can have in your custom Spark pool depends on your node size and Fabric capacity. Capacity is a measure of how much computing power you can use in Azure. One way to think of it is that two Spark vCores (a unit of computing power for Spark) equals one capacity unit. For example, a Fabric Capacity SKU F64 has 64 capacity units, which is equivalent to 128 Spark VCores. So, if you choose a small node size, you can have up to 32 nodes in your pool (128/4 = 32). Then, the total of vCores in the capacity/vCores per node size = total number of nodes available. For more information, see Spark compute.

  • Node size family: Fabric Spark pools only support Memory Optimized node size family for now. If you're using a GPU-accelerated SKU Spark pool in Azure Synapse, they aren't available in Fabric.

  • Node size: The xx-large node size comes with 432 GB of memory in Azure Synapse, while the same node size has 512 GB in Fabric including 64 vCores. The rest of the node sizes (small through x-large) have the same vCores and memory in both Azure Synapse and Fabric.

  • Automatic pausing: If you enable it in Azure Synapse Spark, the Apache Spark pool will automatically pause after a specified amount of idle time. This setting is configurable in Azure Synapse (minimum 5 minutes), but custom pools have a noncustomizable default autopause duration of 2 minutes in Fabric after the session expires. The default session expiration is set to 20 minutes in Fabric.

  • High concurrency: Fabric supports high concurrency in notebooks. For more information, see High concurrency mode in Fabric Spark.

  • Concurrency limits: In terms of concurrency, Azure Synapse Spark has a limit of 50 simultaneous running jobs per Spark pool and 200 queued jobs per Spark pool. The maximum active jobs are 250 per Spark pool and 1000 per workspace. In Microsoft Fabric Spark, capacity SKUs define the concurrency limits. SKUs have varying limits on max concurrent jobs that range from 1 to 512. Also, Fabric Spark has a dynamic reserve-based throttling system to manage concurrency and ensure smooth operation even during peak usage times. For more information, see Concurrency limits and queueing in Microsoft Fabric Spark and Fabric capacities.

  • Multiple Spark pools: If you want to have multiple Spark pools, use Fabric environments to select a pool by notebook or Spark job definition. For more information, see Create, configure, and use an environment in Microsoft Fabric.

Spark configurations comparison

Spark configurations can be applied at different levels:

  • Environment level: These configurations are used as the default configuration for all Spark jobs in the environment.
  • Inline level: Set Spark configurations inline using notebooks and Spark job definitions.

While both options are supported in Azure Synapse Spark and Fabric, there are some considerations:

Spark configuration Azure Synapse Spark Fabric Spark
Environment level Yes, pools Yes, environments
Inline Yes Yes
Import/export Yes Yes (.yml from environments)
API/SDK support Yes Yes
  • Environment level: In Azure Synapse, you can define multiple Spark configurations and assign them to different Spark pools. You can do this in Fabric by using environments.

  • Inline: In Azure Synapse, both notebooks and Spark jobs support attaching different Spark configurations. In Fabric, session level configurations are customized with the spark.conf.set(<conf_name>, <conf_value>) setting. For batch jobs, you can also apply configurations via SparkConf.

  • Import/export: This option for Spark configurations is available in Fabric environments.

  • Other considerations:

    • Immutable Spark configurations: Some Spark configurations are immutable. If you get the message AnalysisException: Can't modify the value of a Spark config: <config_name>, the property in question is immutable.
    • FAIR scheduler: FAIR scheduler is used in high concurrency mode.
    • V-Order: V-Order is write-time optimization applied to the parquet files enabled by default in Fabric Spark pools.
    • Optimized Write: Optimized Write is disabled by default in Azure Synapse but enabled by default for Fabric Spark.

Spark libraries comparison

You can apply Spark libraries at different levels:

  • Workspace level: You can't upload/install these libraries to your workspace and later assign them to a specific Spark pool in Azure Synapse.
  • Environment level: You can upload/install libraries to an environment. Environment-level libraries are available to all notebooks and Spark job definitions running in the environment.
  • Inline: In addition to environment-level libraries, you can also specify inline libraries. For example, at the beginning of a notebook session.

Considerations:

Spark library Azure Synapse Spark Fabric Spark
Workspace level Yes No
Environment level Yes, Pools Yes, environments
Inline Yes Yes
Import/export Yes Yes
API/SDK support Yes Yes
  • Other considerations:
    • Built-in libraries: Fabric and Azure Synapse share a common core of Spark, but they can slightly differ in different support of their runtime libraries. Typically, using code is compatible with some exceptions. In that case, users might need compilation, the addition of custom libraries, and adjusting syntax. See built-in Fabric Spark runtime libraries here.

Notebook comparison

Notebooks and Spark job definitions are primary code items for developing Apache Spark jobs in Fabric. There are some differences between Azure Synapse Spark notebooks and Fabric Spark notebooks:

Notebook capability Azure Synapse Spark Fabric Spark
Import/export Yes Yes
Session configuration Yes, UI and inline Yes, UI (environment) and inline
IntelliSense Yes Yes
mssparkutils Yes Yes
Notebook resources No Yes
Collaborate No Yes
High concurrency No Yes
.NET for Spark C# Yes No
Pipeline activity support Yes Yes
Built-in scheduled run support No Yes
API/SDK support Yes Yes
  • mssparkutils: Because DMTS connections aren't supported in Fabric yet, only getToken and getSecret are supported for now in Fabric for mssparkutils.credentials.

  • Notebooks resources: Fabric notebooks provide a Unix-like file system to help you manage your folders and files. For more information, see How to use Microsoft Fabric notebooks.

  • Collaborate: The Fabric notebook is a collaborative item that supports multiple users editing the same notebook. For more information, see How to use Microsoft Fabric notebooks.

  • High concurrency: In Fabric, you can attach notebooks to a high concurrency session. This option is an alternative for users using ThreadPoolExecutor in Azure Synapse. For more information, see Configure high concurrency mode for Fabric notebooks.

  • .NET for Spark C#: Fabric doesn't support .NET Spark (C#). However, we recommendation that users with existing workloads written in C# or F# migrate to Python or Scala.

  • Built-in scheduled run support: Fabric supports scheduled runs for notebooks.

  • Other considerations:

    • You can use features inside a notebook that are only supported in a specific version of Spark. Remember that Spark 2.4 and 3.1 aren't supported in Fabric.
    • If your notebook or Spark job is using a linked service with different data source connections or mount points, you should modify your Spark jobs to use alternative methods for handling connections to external data sources and sinks. Use Spark code to connect to data sources using available Spark libraries.

Spark job definition comparison

Important Spark job definition considerations:

Spark job capability Azure Synapse Spark Fabric Spark
PySpark Yes Yes
Scala Yes Yes
.NET for Spark C# Yes No
SparkR No Yes
Import/export Yes (UI) No
Pipeline activity support Yes Yes
Built-in scheduled run support No Yes
Retry policies No Yes
API/SDK support Yes Yes
  • Spark jobs: You can bring your .py/.R/jar files. Fabric supports SparkR. A Spark job definition supports reference files, command line arguments, Spark configurations, and lakehouse references.

  • Import/export: In Azure Synapse, you can import/export json-based Spark job definitions from the UI. This feature isn't available yet in Fabric.

  • .NET for Spark C#: Fabric doesn't support .NET Spark (C#). However, the recommendation is that users with existing workloads written in C# or F# migrate to Python or Scala.

  • Built-in scheduled run support: Fabric supports scheduled runs for a Spark job definition.

  • Retry policies: This option enables users to run Spark-structured streaming jobs indefinitely.

Hive Metastore (HMS) comparison

Hive MetaStore (HMS) differences and considerations:

HMS type Azure Synapse Spark Fabric Spark
Internal HMS Yes Yes (lakehouse)
External HMS Yes No
  • External HMS: Fabric currently doesn't support a Catalog API and access to an external Hive Metastore (HMS).