Change data capture from SAP to Microsoft Fabric OneLake with Azure Data Factory

2024-12-18

Azure Data Factory (ADF) and Azure Synapse Analytics come with the SAP change data capture (CDC) connector (for an introduction see Overview and architecture of the SAP CDC capabilities or General availability of SAP CDC capabilities for Azure Data Factory and Azure Synapse Analytics), which provides built-in change data capture capabilities covering various SAP sources. Many customers use this connector to establish a change data feed from SAP into Delta tables in ADLS Gen2, which is a great storage option for the inbound (also referred to as bronze) layer in a Lakehouse architecture.

This article describes how to use the SAP CDC connector to create your inbound layer directly in Microsoft Fabric OneLake using ADF or Synapse.

The two scenarios are similar in setup, with the main difference being the sink configuration. In fact, as the following diagram suggests, you can simply clone a dataflow writing into Delta tables in Azure Data Lake Storage (ADLS) Gen2, change the sink configuration according to this document, and you’re ready to go.

If you plan to migrate your SAP data from ADLS Gen2 into Microsoft Fabric, you can even redirect an existing CDC dataflow from ADLS Gen2 to OneLake by adjusting the sink configuration. After changing the sink, you can resume the original CDC process, allowing you to seamlessly migrate to Fabric without a cumbersome reinitialization.

Diagram of the architecture for SAP change data capture to OneLake using Azure Data Factory.

Getting started

To follow this article step-by-step in your own environment, you need the following resources:

An Azure Data Factory or Synapse Analytics workspace.
A Microsoft Fabric workspace with a Lakehouse.
An SAP system that satisfies the requirements for ADF’s SAP CDC connector specified here. In our scenario, we use an SAP S/4HANA on-premises 2023 FPS00, but all up-to-date versions of SAP ECC, SAP BW, SAP BW/4HANA, etc. are supported as well.
A self-hosted integration runtime (SHIR) with a current version of the SAP .NET Connector installed.

In order to concentrate on the connectivity part, here’s a pipeline template that covers the most straightforward scenario of extracting change data using the SAP CDC connector and merging it with a Fabric Lakehouse table without any further transformations: https://github.com/ukchrist/ADF-SAP-data-flows/blob/main/p_SAPtoFabric.zip. If you’re familiar with ADF mapping dataflows and SAP CDC, you can set up a scenario from scratch by yourself and skip to the following configuration of the Lakehouse linked service.

To make use of the template, the following steps are required:

Create three linked services to connect to the SAP source, the staging folder, and the Fabric Lakehouse.
Import the template into your ADF or Synapse workspace.
Configure the template with a source object from your SAP system and a sink table.

Setting up connectivity to the SAP source system

To connect your SAP source system to ADF or Synapse with the SAP CDC connector, you need a self-hosted integration runtime. The installation procedure is described here: Set up a self-hosted integration runtime for the SAP CDC connector. For the self-hosted integration runtime to connect to the SAP source system via SAP’s RFC protocol, download and install the SAP .NET Connector as described here: Download and install the SAP .NET connector.

Next, create an SAP CDC linked service. Details can be found here: Set up a linked service and dataset for the SAP CDC connector. For this, you need the SAP system connection parameters (application/message server, instance number, client ID) and user credentials to connect to the SAP system. For details on the configuration required for this SAP user read this document: Set up the SAP user.

Creating an SAP CDC dataset as described in the document on linked service configuration is optional - mapping data flows offer a leaner option to define the dataset properties inline in the data flow itself. The pipeline template provided here uses such an inline dataset definition.

Setting up ADLS Gen2 connectivity for staging

Before writing the change data from the SAP source system into the sink, it's staged into a folder in ADLS Gen2. From there, the mapping data flow runtime picks up the data and processes it according to the steps defined in the data flow. The data flow provided as part of the template merges the changes with the existing data in the sink table and thus give you an up-to-date copy of the source.

Setup of an ADLS Gen2 linked service is described here: Create an Azure Data Lake Storage Gen2 linked service using UI.

Retrieving the Fabric workspace ID and Lakehouse object ID

To collect the required Fabric workspace ID and Lakehouse object ID in Microsoft Fabric, complete the following steps:

Navigate to your Lakehouse in Microsoft Fabric.
Once the Lakehouse experience opens in the browser, copy the browser URL. It has the following format:

https://xxxxxx.powerbi.com/groups/<workspace ID>/lakehouses/<lakehouse ID>

Copy the <workspace ID> and <lakehouse ID> from the URL.

Configuring the service principal

Configuring the service principle requires two steps. First, create the service principal in Microsoft Entra ID. Then, add the service principal as a member to the Fabric workspace.

Let’s start with Microsoft Entra ID.

Navigate to Azure portal and select Microsoft Entra ID from the left-hand side menu. Copy the Tenant ID for later use.
To create the service principal, select App registrations and + New registration.
Enter a Name for the application. The name is the same as the service principal name, so copy it for later use.
Select Accounts in this organizational directory only.
Then select Register.
Copy the Application (client) ID. This step is required in the linked service definition in ADF later. Then select Add a certificate or secret.
Select + New client secret. Add a Description and expiration policy.

a
Copy the Value of the client secret. This step completes the service principal configuration in Microsoft Entra ID.

Next, we add the service principal as a contributor or admin to your Microsoft Fabric workspace.

Navigate to your workspace and select Manage access.
Enter the name of the service principal, select the Contributor or Admin role and select Add. The service principal can now be used to connect ADF to your workspace.

Creating the Lakehouse linked service in ADF or Synapse

Now we’re ready to configure the Lakehouse linked service in ADF or Synapse.

Open your ADF or Synapse workspace, select the Manage tool and select Linked services. Then select + New.
Search for “Lakehouse”, select the Microsoft Fabric Lakehouse linked service type, and select Continue.
Assign a Name to the linked service, select Enter manually and configure the Fabric workspace ID and Lakehouse object ID values copied from the Fabric URL earlier.
- In the Tenant property, provide the Tenant ID you copied in step 1 of the service principal configuration in Microsoft Entra ID.
- For Service principal ID, provide the Application (client) ID (not the service principal name!) copied in step 6 of the service principal configuration.
- For Service principal key, provide the client secret value copied in step 8 of the service principal configuration.
Verify that the connection can be established successfully and select Create.

Configuring the pipeline template

With the setup of the linked services completed, you can import the template and adjust it for your source object.

From the pipeline menu, choose + to add a new resource, then select Pipeline and Template gallery.
The Template gallery dialog appears. Find the Copy change data from SAP to Fabric Lakehouse table template, select it, and select Continue.
The configuration screen opens, in which you specify the linked services to be used to instantiate the template. Enter the linked services created in the prior sections. The first linked service is the one required for the staging folder in ADLS Gen2, the second one is connection to the SAP source and the third one connects to Microsoft Fabric Lakehouse:
After you configure the template, ADF creates the new pipeline, and you can make any adjustments you require for your specific setup. As a first step, configure the staging folder to intermediately store the change data from SAP before it's merged with your delta table in Fabric. Select on the data flow activity in the pipeline and select the Settings tab. In the Staging properties you can see the staging linked service configured in the last step. Enter a Staging storage folder.
Double select on the data flow activity in the pipeline open the mapping dataflow to configure your source and sink. First, select the SAP CDC source transfer, and select the Source options tab. Provide the detail properties of your source object in ODP context, ODP name and Key columns (as a JSON array). Then select a Run mode. For details on these properties, refer to Azure Data Factory documentation for SAP change data capture capabilities.
Select the sink transformation of the data flow, and then select the Settings tab, and enter the Table name for the Lakehouse table in your Fabric workspace. Select the radio button Custom expression of the Key columns property and enter the key columns of your source as a JSON array.
Publish your changes.

Retrieve the data

Navigate back to the pipeline and trigger a pipeline run:
Switch to the monitoring experience and wait for your pipeline run to complete:
Open your Lakehouse in your Fabric workspace. Under Tables, you see the newly created Lakehouse table. In the right of your screen, a preview of the data you loaded from SAP is displayed.