Common Data Model format in Azure Data Factory and Synapse Analytics

Artikkel
01/05/2024

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Tip

Try out Data Factory in Microsoft Fabric, an all-in-one analytics solution for enterprises. Microsoft Fabric covers everything from data movement to data science, real-time analytics, business intelligence, and reporting. Learn how to start a new trial for free!

The Common Data Model (CDM) metadata system makes it possible for data and its meaning to be easily shared across applications and business processes. To learn more, see the Common Data Model overview.

In Azure Data Factory and Synapse pipelines, users can transform data from CDM entities in both model.json and manifest form stored in Azure Data Lake Store Gen2 (ADLS Gen2) using mapping data flows. You can also sink data in CDM format using CDM entity references that will land your data in CSV or Parquet format in partitioned folders.

Mapping data flow properties

The Common Data Model is available as an inline dataset in mapping data flows as both a source and a sink.

Note

When writing CDM entities, you must have an existing CDM entity definition (metadata schema) already defined to use as a reference. The data flow sink will read that CDM entity file and import the schema into your sink for field mapping.

Source properties

The below table lists the properties supported by a CDM source. You can edit these properties in the Source options tab.

Name	Description	Required	Allowed values	Data flow script property
Format	Format must be `cdm`	yes	`cdm`	format
Metadata format	Where the entity reference to the data is located. If using CDM version 1.0, choose manifest. If using a CDM version before 1.0, choose model.json.	Yes	`'manifest'` or `'model'`	manifestType
Root location: container	Container name of the CDM folder	yes	String	fileSystem
Root location: folder path	Root folder location of CDM folder	yes	String	folderPath
Manifest file: Entity path	Folder path of the entity within the root folder	no	String	entityPath
Manifest file: Manifest name	Name of the manifest file. Default value is 'default'	No	String	manifestName
Filter by last modified	Choose to filter files based upon when they were last altered	no	Timestamp	modifiedAfter modifiedBefore
Schema linked service	The linked service where the corpus is located	yes, if using manifest	`'adlsgen2'` or `'github'`	corpusStore
Entity reference container	Container corpus is in	yes, if using manifest and corpus in ADLS Gen2	String	adlsgen2_fileSystem
Entity reference Repository	GitHub repository name	yes, if using manifest and corpus in GitHub	String	github_repository
Entity reference Branch	GitHub repository branch	yes, if using manifest and corpus in GitHub	String	github_branch
Corpus folder	the root location of the corpus	yes, if using manifest	String	corpusPath
Corpus entity	Path to entity reference	yes	String	entity
Allow no files found	If true, an error is not thrown if no files are found	no	`true` or `false`	ignoreNoFilesFound

When selecting "Entity Reference" both in the Source and Sink transformations, you can select from these three options for the location of your entity reference:

Local uses the entity defined in the manifest file already being used by the service
Custom will ask you to point to an entity manifest file that is different from the manifest file the service is using
Standard will use an entity reference from the standard library of CDM entities maintained in GitHub.

Sink settings

Point to the CDM entity reference file that contains the definition of the entity you would like to write.

entity settings

Define the partition path and format of the output files that you want the service to use for writing your entities.

entity format

Set the output file location and the location and name for the manifest file.

cdm location

Import schema

CDM is only available as an inline dataset and, by default, doesn't have an associated schema. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus. To import the schema, a data flow debug session must be active and you must have an existing CDM entity definition file to point to.

When mapping data flow columns to entity properties in the Sink transformation, click on the "Mapping" tab and select "Import Schema". The service will read the entity reference that you pointed to in your Sink options, allowing you to map to the target CDM schema.

CDM sink settings

Note

When using model.json source type that originates from Power BI or Power Platform dataflows, you may encounter "corpus path is null or empty" errors from the source transformation. This is likely due to formatting issues of the partition location path in the model.json file. To fix this, follow these steps:

Open the model.json file in a text editor
Find the partitions.Location property
Change "blob.core.windows.net" to "dfs.core.windows.net"
Fix any "%2F" encoding in the URL to "/"
If using ADF Data Flows, Special characters in the partition file path must be replaced with alpha-numeric values, or switch to Azure Synapse Data Flows

CDM source data flow script example

source(output(
        ProductSizeId as integer,
        ProductColor as integer,
        CustomerId as string,
        Note as string,
        LastModifiedDate as timestamp
    ),
    allowSchemaDrift: true,
    validateSchema: false,
    entity: 'Product.cdm.json/Product',
    format: 'cdm',
    manifestType: 'manifest',
    manifestName: 'ProductManifest',
    entityPath: 'Product',
    corpusPath: 'Products',
    corpusStore: 'adlsgen2',
    adlsgen2_fileSystem: 'models',
    folderPath: 'ProductData',
    fileSystem: 'data') ~> CDMSource

Sink properties

The below table lists the properties supported by a CDM sink. You can edit these properties in the Settings tab.

Name	Description	Required	Allowed values	Data flow script property
Format	Format must be `cdm`	yes	`cdm`	format
Root location: container	Container name of the CDM folder	yes	String	fileSystem
Root location: folder path	Root folder location of CDM folder	yes	String	folderPath
Manifest file: Entity path	Folder path of the entity within the root folder	no	String	entityPath
Manifest file: Manifest name	Name of the manifest file. Default value is 'default'	No	String	manifestName
Schema linked service	The linked service where the corpus is located	yes	`'adlsgen2'` or `'github'`	corpusStore
Entity reference container	Container corpus is in	yes, if corpus in ADLS Gen2	String	adlsgen2_fileSystem
Entity reference Repository	GitHub repository name	yes, if corpus in GitHub	String	github_repository
Entity reference Branch	GitHub repository branch	yes, if corpus in GitHub	String	github_branch
Corpus folder	the root location of the corpus	yes	String	corpusPath
Corpus entity	Path to entity reference	yes	String	entity
Partition path	Location where the partition will be written	no	String	partitionPath
Clear the folder	If the destination folder is cleared prior to write	no	`true` or `false`	truncate
Format type	Choose to specify parquet format	no	`parquet` if specified	subformat
Column delimiter	If writing to DelimitedText, how to delimit columns	yes, if writing to DelimitedText	String	columnDelimiter
First row as header	If using DelimitedText, whether the column names are added as a header	no	`true` or `false`	columnNamesAsHeader

CDM sink data flow script example

The associated data flow script is:

CDMSource sink(allowSchemaDrift: true,
    validateSchema: false,
    entity: 'Product.cdm.json/Product',
    format: 'cdm',
    entityPath: 'ProductSize',
    manifestName: 'ProductSizeManifest',
    corpusPath: 'Products',
    partitionPath: 'adf',
    folderPath: 'ProductSizeData',
    fileSystem: 'cdm',
    subformat: 'parquet',
    corpusStore: 'adlsgen2',
    adlsgen2_fileSystem: 'models',
    truncate: true,
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> CDMSink

Create a source transformation in mapping data flow.

Del via

Common Data Model format in Azure Data Factory and Synapse Analytics