Troubleshoot connector and format issues in mapping data flows in Azure Data Factory

This article explores troubleshooting methods related to connector and format for mapping data flows in Azure Data Factory (ADF).

Azure Blob Storage

Account Storage type (general purpose v1) doesn't support service principal and MI authentication

Symptoms

In data flows, if you use Azure Blob Storage (general purpose v1) with the service principal or MI authentication, you may encounter the following error message:

com.microsoft.dataflow.broker.InvalidOperationException: ServicePrincipal and MI auth are not supported if blob storage kind is Storage (general purpose v1)

Cause

When you use the Azure Blob linked service in data flows, the managed identity or service principal authentication is not supported when the account kind is empty or "Storage". This situation is shown in Image 1 and Image 2 below.

Image 1: The account kind in the Azure Blob Storage linked service

Screenshot that shows the storage account kind in the Azure Blob Storage linked service.

Image 2: Storage account page

Screenshot that shows storage account page.

Recommendation

To solve this issue, refer to the following recommendations:

  • If the storage account kind is None in the Azure Blob linked service, specify the proper account kind, and refer to Image 3 shown below to accomplish it. Furthermore, refer to Image 2 to get the storage account kind, and check and confirm the account kind is not Storage (general purpose v1).

    Image 3: Specify the storage account kind in the Azure Blob Storage linked service

    Screenshot that shows how to specify storage account kind in Azure Blob Storage linked service.

  • If the account kind is Storage (general purpose v1), upgrade your storage account to the general purpose v2 or choose a different authentication.

    Image 4: Upgrade the storage account to general purpose v2

    Screenshot that shows how to upgrade the storage account to general purpose v2.

Azure Cosmos DB and JSON format

Support customized schemas in the source

Symptoms

When you want to use the ADF data flow to move or transfer data from Azure Cosmos DB/JSON into other data stores, some columns of the source data may be missed. 

Cause 

For the schema-free connectors (the column number, column name and column data type of each row can be different when comparing with others), by default, ADF uses sample rows (for example, top 100 or 1000 rows data) to infer the schema, and the inferred result will be used as a schema to read data. So if your data stores have extra columns that don't appear in sample rows, the data of these extra columns are not read, moved, or transferred into sink data stores.

Recommendation

To overwrite the default behavior and bring in additional fields, ADF provides options for you to customize the source schema. You can specify additional/missing columns that could be missing in schema-infer-result in the data flow source projection to read the data, and you can apply one of the following options to set the customized schema. Usually, Option-1 is more preferred.

  • Option-1: Compared with the original source data that may be one large file, table, or container that contains millions of rows with complex schemas, you can create a temporary table/container with a few rows that contain all the columns you want to read, and then move on to the following operation: 

    1. Use the data flow source Debug Settings to have Import projection with sample files/tables to get the complete schema. You can follow the steps in the following picture:

      Screenshot that shows the first part of the first option to customize the source schema.

      1. Select Debug settings in the data flow canvas.
      2. In the pop-up pane, select Sample table under the cosmosSource tab, and enter the name of your table in the Table block.
      3. Select Save to save your settings.
      4. Select Import projection.
    2. Change the Debug Settings back to use the source dataset for the remaining data movement/transformation. You can move on with the steps in the following picture:

      Screenshot that shows the second part of the first option to customize the source schema.

      1. Select Debug settings in the data flow canvas.
      2. In the pop-up pane, select Source dataset under the cosmosSource tab.
      3. Select Save to save your settings.

    Afterwards, the ADF data flow runtime will honor and use the customized schema to read data from the original data store. 

  • Option-2: If you are familiar with the schema and DSL language of the source data, you can manually update the data flow source script to add additional/missed columns to read the data. An example is shown in the following picture:

    Screenshot that shows the second option to customize the source schema.

Support map type in the source

Symptoms

In ADF data flows, map data type cannot be directly supported in Azure Cosmos DB or JSON source, so you cannot get the map data type under "Import projection".

Cause

For Azure Cosmos DB and JSON, they are schema-free connectivity and related spark connector uses sample data to infer the schema, and then that schema is used as the Azure Cosmos DB/JSON source schema. When inferring the schema, the Azure Cosmos DB/JSON Spark connector can only infer object data as a struct rather than a map data type, and that's why map type cannot be directly supported.

Recommendation 

To solve this issue, refer to the following examples and steps to manually update the script (DSL) of the Azure Cosmos DB/JSON source to get the map data type support.

Examples:

Screenshot that shows examples of updating the script (DSL) of the Azure Cosmos DB/JSON source.

Step-1: Open the script of the data flow activity.

Screenshot that shows how to open the script of the data flow activity.

Step-2: Update the DSL to get the map type support by referring to the examples above.

Screenshot that shows how to update the DSL.

The map type support:

Type Is the map type supported? Comments
Excel, CSV No Both are tabular data sources with the primitive type, so there is no need to support the map type.
Orc, Avro Yes None.
JSON Yes The map type cannot be directly supported, follow the recommendation part in this section to update the script (DSL) under the source projection.
Azure Cosmos DB Yes The map type cannot be directly supported, follow the recommendation part in this section to update the script (DSL) under the source projection.
Parquet Yes Today the complex data type is not supported on the parquet dataset, so you need to use the "Import projection" under the data flow parquet source to get the map type.
XML No None.

Consume JSON files generated by copy activities

Symptoms

If you use the copy activity to generate some JSON files, and then try to read these files in data flows, you fail with the error message: JSON parsing error, unsupported encoding or multiline

Cause

There are following limitations on JSON for copy and data flows respectively:

  • For Unicode encodings (utf-8, utf-16, utf-32) JSON files, copy activities always generate the JSON files with BOM.

  • The data flow JSON source with "Single document" enabled doesn't support Unicode encoding with BOM.

    Screenshot that shows the enabled 'Single document'.

So you will experience issues if the following criteria are met:

  • The sink dataset used by the copy activity is set to Unicode encoding (utf-8, utf-16, utf-16be, utf-32, utf-32be) or the default is used.

  • The copy sink is set to use "Array of objects" file pattern as shown in the following picture, no matter whether "Single document" is enabled or not in the data flow JSON source.

    Screenshot that shows the set 'Array of objects' pattern.

Recommendation

  • Always use the default file pattern or explicit "Set of objects" pattern in the copy sink if the generated files are used in data flows.
  • Disable the "Single document" option in the data flow JSON source.

Note

Using "Set of objects" is also the recommended practice from the performance perspective. As the "Single document" JSON in the data flow can't enable parallel reading for single large files, this recommendation does not have any negative impact.

The query with parameters does not work

Symptoms

Mapping data flows in Azure Data Factory supports the use of parameters. The parameter values are set by the calling pipeline via the Execute Data Flow activity, and using parameters is a good way to make your data flow general-purpose, flexible, and reusable. You can parameterize data flow settings and expressions with these parameters: Parameterizing mapping data flows.

After setting parameters and using them in the query of data flow source, they do not take effective.

Cause

You encounter this error due to your wrong configuration.

Recommendation

Use the following rules to set parameters in the query, and for more detailed information, refer to Build expressions in mapping data flow.

  1. Apply double quotes at the beginning of the SQL statement.
  2. Use single quotes around the parameter.
  3. Use lowercase letters for all CLAUSE statements.

For example:

Screenshot that shows the set parameter in the query.

Azure Data Lake Storage Gen1

Fail to create files with service principle authentication

Symptoms

When you try to move or transfer data from different sources into the ADLS gen1 sink, if the linked service's authentication method is service principle authentication, your job may fail with the following error message:

org.apache.hadoop.security.AccessControlException: CREATE failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [2b5e5d92-xxxx-xxxx-xxxx-db4ce6fa0487] failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.)

Cause

The RWX permission or the dataset property is not set correctly.

Recommendation

  • If the target folder doesn't have correct permissions, refer to this document to assign the correct permission in Gen1: Use service principal authentication.

  • If the target folder has the correct permission and you use the file name property in the data flow to target to the right folder and file name, but the file path property of the dataset is not set to the target file path (usually leave not set), as the example shown in the following pictures, you will encounter this failure because the backend system tries to create files based on the file path of the dataset, and the file path of the dataset doesn't have the correct permission.

    Screenshot that shows the file path property.

    Screenshot that shows the file name property.

    There are two methods to solve this issue:

    1. Assign the WX permission to the file path of the dataset.
    2. Set the file path of the dataset as the folder with WX permission, and set the rest folder path and file name in data flows.

Azure Data Lake Storage Gen2

Failed with an error: "Error while reading file XXX. It is possible the underlying files have been updated"

Symptoms

When you use the ADLS Gen2 as a sink in the data flow (to preview data, debug/trigger run, etc.) and the partition setting in Optimize tab in the Sink stage is not default, you may find job fail with the following error message:

Job failed due to reason: Error while reading file abfss:REDACTED_LOCAL_PART@prod.dfs.core.windows.net/import/data/e3342084-930c-4f08-9975-558a3116a1a9/part-00000-tid-7848242374008877624-5df7454e-7b14-4253-a20b-d20b63fe9983-1-1-c000.csv. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Cause

  1. You don't assign a proper permission to your MI/SP authentication.
  2. You may have a customized job to handle files that you don't want, which will affect the data flow's middle output.

Recommendation

  1. Check if your linked service has the R/W/E permission for Gen2. If you use the MI auth/SP authentication, at least grant the Storage Blob Data Contributor role in the Access control (IAM).
  2. Confirm if you have specific jobs that move/delete files to other place whose name does not match your rule. Because data flows will write down partition files into the target folder firstly and then do the merge and rename operations, the middle file's name might not match your rule.

Azure Database for PostgreSQL

Encounter an error: Failed with exception: handshake_failure

Symptoms

You use Azure PostgreSQL as a source or sink in the data flow such as previewing data and debugging/triggering run, and you may find the job fails with following error message:

PSQLException: SSL error: Received fatal alert: handshake_failure
Caused by: SSLHandshakeException: Received fatal alert: handshake_failure.

Cause

If you use the flexible server or Hyperscale (Citus) for your Azure PostgreSQL server, since the system is built via Spark upon Azure Databricks cluster, there is a limitation in Azure Databricks blocks our system to connect to the Flexible server or Hyperscale (Citus). You can review the following two links as references.

Recommendation

You can try to use copy activities to unblock this issue.

Azure SQL Database

Unable to connect to the SQL Database

Symptoms

Your Azure SQL Database can work well in the data copy, dataset preview-data, and test-connection in the linked service, but it fails when the same Azure SQL Database is used as a source or sink in the data flow with error like Cannot connect to SQL database: 'jdbc:sqlserver://powerbasenz.database.windows.net;..., Please check the linked service configuration is correct, and make sure the SQL database firewall allows the integration runtime to access.'

Cause

There are wrong firewall settings on your Azure SQL Database server, so that it cannot be connected by the data flow runtime. Currently, when you try to use the data flow to read/write Azure SQL Database, Azure Databricks is used to build spark cluster to run the job, but it does not support fixed IP ranges. For more details, please refer to Azure Integration Runtime IP addresses.

Recommendation

Check the firewall settings of your Azure SQL Database and set it as "Allow access to Azure services" rather than set the fixed IP range.

Screenshot that shows how to allow access to Azure service in firewall settings.

Syntax error when using queries as input

Symptoms

When you use queries as input in the data flow source with the Azure SQL, you fail with the following error message:

at Source 'source1': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax XXXXXXXX.

Screenshot that shows the error details.

Cause

The query used in the data flow source should be able to run as a sub query. The reason of the failure is that either the query syntax is incorrect or it can't be run as a sub query. You can run the following query in the SSMS to verify it:

SELECT top(0) * from ($yourQuery) as T_TEMP

Recommendation

Provide a correct query and test it in the SSMS firstly.

Failed with an error: "SQLServerException: 111212; Operation cannot be performed within a transaction."

Symptoms

When you use the Azure SQL Database as a sink in the data flow to preview data, debug/trigger run and do other activities, you may find your job fails with following error message:

{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'sink': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: 111212;Operation cannot be performed within a transaction.","Details":"at Sink 'sink': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: 111212;Operation cannot be performed within a transaction."}

Cause

The error "111212;Operation cannot be performed within a transaction." only occurs in the Synapse dedicated SQL pool. But you mistakenly use the Azure SQL Database as the connector instead.

Recommendation

Confirm if your SQL Database is a Synapse dedicated SQL pool. If so, use Azure Synapse Analytics as a connector shown in the picture below.

Screenshot that shows the Azure Synapse Analytics connector.

Data with the decimal type become null

Symptoms

You want to insert data into a table in the SQL database. If the data contains the decimal type and need to be inserted into a column with the decimal type in the SQL database, the data value may be changed to null.

If you do the preview, in previous stages, it will show the value like the following picture:

Screenshot that shows the value in the previous stages.

In the sink stage, it will become null, which is shown in the picture below.

Screenshot that shows the value in the sink stage.

Cause

The decimal type has scale and precision properties. If your data type doesn't match that in the sink table, the system will validate that the target decimal is wider than the original decimal, and the original value does not overflow in the target decimal. Therefore, the value will be cast to null.

Recommendation

Check and compare the decimal type between data and table in the SQL database, and alter the scale and precision to the same.

You can use toDecimal (IDecimal, scale, precision) to figure out if the original data can be cast to the target scale and precision. If it returns null, it means that the data cannot be cast and furthered when inserting.

Azure Synapse Analytics

Symptoms

You use the Azure Synapse Analytics and the linked service actually is a Synapse serverless pool. Its former name is SQL on-demand pool, and it can be distinguished by the server name contains ondemand, for example, space-ondemand.sql.azuresynapse.net. You may face with several unique failures as below:

  1. When you want to use Synapse serverless pool as a Sink, you face the following error:
    Sink results in 0 output columns. Please ensure at least one column is mapped
  2. When you select 'enable staging' in the Source, you face the following error: shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near 'IDENTITY'.
  3. When you want to fetch data from an external table, you face the following error: shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: External table 'dbo' is not accessible because location does not exist or it is used by another process.
  4. When you want to fetch data from Azure Cosmos DB through Serverless pool by query/from view, you face the following error: Job failed due to reason: Connection reset.
  5. When you want to fetch data from a view, you may face with different errors.

Cause

Causes of the symptoms are stated below respectively:

  1. Serverless pool cannot be used as a sink. It doesn't support write data into the database.
  2. Serverless pool doesn't support staged data loading, so 'enable staging' is not supported.
  3. The authentication method that you use doesn't have a correct permission to the external data source where the external table referring to.
  4. There is a known limitation in Synapse serverless pool, blocking you to fetch Azure Cosmos DB data from data flows.
  5. View is a virtual table based on an SQL statement. The root cause is inside the statement of the view.

Recommendation

You can apply the following steps to solve your issues correspondingly.

  1. You should better not use serverless pool as a sink.

  2. Do not use 'enable staging' in Source for serverless pool.

  3. Only service principal/managed identity that has the permission to the external table data can query it. Grant 'Storage Blob Data Contributor' permission to the external data source for the authentication method that you use in the ADF.

    Note

    The user-password authentication can not query external tables. For more information, see Security model.

  4. You can use copy activity to fetch Azure Cosmos DB data from the serverless pool.

  5. You can provide the SQL statement that creates the view to the engineering support team, and they can help analyze if the statement hits an authentication issue or something else.

Load small size data to Data Warehouse without staging is slow

Symptoms

When you load small data to Data Warehouse without staging, it will take a long time to finish. For example, the data size 2 MB but it takes more than 1 hour to finish.

Cause

This issue is caused by the row count rather than the size. The row count has few thousand, and each insert needs to be packaged into an independent request, go to the control node, start a new transaction, get locks, and go to the distribution node repeatedly. Bulk load gets the lock once, and each distribution node performs the insert by batching into memory efficiently.

If 2 MB is inserted as just a few records, it would be fast. For example, it would be fast if each record is 500 kb * 4 rows.

Recommendation

You need to enable staging to improve the performance.

Read empty string value ("") as NULL with the enable staging

Symptoms

When you use Synapse as a source in the data flow such as previewing data and debugging/triggering run and enable staging to use the PolyBase, if your column value contains empty string value (""), it will be changed to null.

Cause

The data flow back end uses Parquet as the PolyBase format, and there is a known limitation in the Synapse SQL pool gen2, which will automatically change the empty string value to null.

Recommendation

You can try to solve this issue by the following methods:

  1. If your data size is not huge, you can disable Enable staging in the Source, but the performance will be affected.
  2. If you need to enable staging, you can use iifNull() function to manually change the specific column from null to empty string value.

Managed service identity error

Symptoms

When you use the Synapse as a source/sink in the data flow to preview data, debug/trigger run, etc. and enable staging to use the PolyBase, and the staging store's linked service (Blob, Gen2, etc.) is created to use the Managed Identity (MI) authentication, your job could fail with the following error shown in the picture:

Screenshot that shows the service identity error.

Error message

shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: Managed Service Identity has not been enabled on this server. Please enable Managed Service Identity and try again.

Cause

  1. If the SQL pool is created from Synapse workspace, MI authentication on staging store with the PolyBase is not supported for the old SQL pool.
  2. If the SQL pool is the old Data Warehouse (DWH) version, MI of the SQL server is not assigned to the staging store.

Recommendation

Confirm the SQL pool was created from the Azure Synapse workspace.

  • If the SQL pool was created from the Azure Synapse workspace, no additional steps are necessary. You no longer need to re-register the Managed Identity (MI) of the workspace. The system assigned managed identity (SA-MI) of the workspace is a member of the Synapse Administrator role and thus has elevated privileges on the dedicated SQL pools of the workspace.
  • If the SQL pool is a dedicated SQL pool (formerly SQL DW) pre-dating Azure Synapse, only enable MI for your SQL server and assign the permission of the staging store to the MI of your SQL Server. You can refer to the steps in this article as an example: Use virtual network service endpoints and rules for servers in Azure SQL Database.

Failed with an error: "SQLServerException: Not able to validate external location because the remote server returned an error: (403)"

Symptoms

When you use SQLDW as a sink to trigger and run data flow activities, the activity may fail with error like: "SQLServerException: Not able to validate external location because the remote server returned an error: (403)"

Cause

  1. When you use the managed identity in the authentication method in the ADLS Gen2 account as staging, cx may not set the authentication configuration correctly.
  2. With the VNET integration runtime, you need to use the managed identity in the authentication method in the ADLS Gen2 account as staging. If your staging Azure Storage is configured with the VNet service endpoint, you must use managed identity authentication with "allow trusted Microsoft service" enabled on the storage account.
  3. Check whether your folder name contains the space character or other special characters, for example: Space " < > # % |. Currently folder names that contain certain special characters are not supported in the Data Warehouse copy command.

Recommendation

For Cause 1, you can refer to the following document: Use virtual network service endpoints and rules for servers in Azure SQL Database-Steps to solve this issue.

For Cause 2, work around it with one of the following options:

  • Option-1: If you use the VNET integration runtime, you need to use the managed identity in the authentication method in the ADLS GEN 2 account as staging.

  • Option-2: If your staging Azure Storage is configured with the VNet service endpoint, you must use managed identity authentication with "allow trusted Microsoft service" enabled on the storage account. You can refer to this doc: Staged copy by using PolyBase for more information.

For Cause 3, work around it with one of the following options:

  • Option-1: Rename the folder and avoid using special characters in the folder name.

  • Option-2: Remove the property allowCopyCommand:true in the data flow script, for example:

    Screenshot that shows how to remove 'allowcopycommand:true'.

Failed with an error: "This operation is not permitted on a non-empty directory"

Symptoms

When you use Azure Synapse Analytics as a sink in the data flow to preview data, debug/trigger run or do other activities and the enable staging is set to true, your job may fail with the following error message:

DF-SYS-01 at Sink 'sink': Unable to stage data before write. Check configuration/credentials of storage.
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.

Cause

You use the Azure Blob Storage as the staging linked service to link to a storage account that has the enabled hierarchical namespace, and that account uses key authentication in the linked service.

Screenshot that shows the storage account configuration.

Recommendation

Create an Azure Data Lake Gen2 linked service for the storage, and select the Gen2 storage as the staging linked service in data flow activities.

Failed with an error: "shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: User does not have permission to perform this action."

Symptoms

When you use Azure Synapse Analytics as a source/sink and use PolyBase staging in data flows, you meet the following error:

shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: User does not have permission to perform this action.

Cause

PolyBase requires certain permissions in your Synapse SQL server to work.

Recommendation

Grant the permissions below in your Synapse SQL server when you use PolyBase:

ALTER ANY SCHEMA
ALTER ANY EXTERNAL DATA SOURCE
ALTER ANY EXTERNAL FILE FORMAT
CONTROL DATABASE

Common Data Model format

Model.json files with special characters

Symptoms

You may encounter an issue that the final name of the model.json file contains special characters.  

Error message  

at Source 'source1': java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: PPDFTable1.csv@snapshot=2020-10-21T18:00:36.9469086Z.  

Recommendation  

Replace the special chars in the file name, which will work in the synapse but not in ADF.  

No data output in the data preview or after running pipelines

Symptoms

When you use the manifest.json for CDM, no data is shown in the data preview or shown after running a pipeline. Only headers are shown. You can see this issue in the picture below.

Screenshot that shows the no data output symptom.

Cause

The manifest document describes the CDM folder, for example, what entities that you have in the folder, references of those entities and the data that corresponds to this instance. Your manifest document misses the dataPartitions information that indicates ADF where to read the data, and  since it is empty, it returns zero data. 

Recommendation

Update your manifest document to have the dataPartitions information, and you can refer to this example manifest document to update your document: Common Data Model metadata: Introducing manifest-Example manifest document.

JSON array attributes are inferred as separate columns

Symptoms 

You may encounter an issue where one attribute (string type) of the CDM entity has a JSON array as data. When this data is encountered, ADF infers the data as separate columns incorrectly. As you can see from the following pictures, a single attribute presented in the source (msfp_otherproperties) is inferred as a separate column in the CDM connector's preview.
 

  • In the CSV source data (refer to the second column):

    Screenshot that shows the attribute in the CSV source data.

  • In the CDM source data preview: 

    Screenshot that shows the separate column in the CDM source data.

  You may also try to map drifted columns and use the data flow expression to transform this attribute as an array. But since this attribute is read as a separate column when reading, transforming to an array does not work.  

Cause

This issue is likely caused by the commas within your JSON object value for that column. Since your data file is expected to be a CSV file, the comma indicates that it is the end of a column's value.

Recommendation

To solve this problem, you need to double quote your JSON column and avoid any of the inner quotes with a backslash (\). In this way, the contents of that column's value can be read in as a single column entirely.     

Note

The CDM doesn't inform that the data type of the column value is JSON, yet it informs that it is a string and parsed as such.

Unable to fetch data in the data flow preview

Symptoms

You use CDM with model.json generated by Power BI. When you preview the CDM data using the data flow preview, you encounter an error: No output data.

Cause

The following code exists in the partitions in the model.json file generated by the Power BI data flow.

"partitions": [  
{  
"name": "Part001",  
"refreshTime": "2020-10-02T13:26:10.7624605+00:00",  
"location": "https://datalakegen2.dfs.core.windows.net/powerbi/salesEntities/salesPerfByYear.csv @snapshot=2020-10-02T13:26:10.6681248Z"  
}  

For this model.json file, the issue is the naming schema of the data partition file has special characters, and supporting file paths with '@' do not exist currently.  

Recommendation

Remove the @snapshot=2020-10-02T13:26:10.6681248Z part from the data partition file name and the model.json file, and then try again.

The corpus path is null or empty

Symptoms

When you use CDM in the data flow with the model format, you cannot preview the data, and you encounter the error: DF-CDM_005 The corpus path is null or empty. The error is shown in the following picture:  

Screenshot that shows the corpus path error.

Cause

Your data partition path in the model.json is pointing to a blob storage location and not your data lake. The location should have the base URL of .dfs.core.windows.net for the ADLS Gen2. 

Recommendation

To solve this issue, you can refer to this article: ADF Adds Support for Inline Datasets and Common Data Model to Data Flows, and the following picture shows the way to fix the corpus path error in this article.

Screenshot that shows how to fix the corpus path error.

Unable to read CSV data files

Symptoms 

You use the inline dataset as the common data model with manifest as a source, and you have provided the entry manifest file, root path, entity name, and path. In the manifest, you have the data partitions with the CSV file location. Meanwhile, the entity schema and csv schema are identical, and all validations were successful. However, in the data preview, only the schema rather than the data gets loaded and the data is invisible, which is shown in the following picture:

Screenshot that shows the issue of unable to read data files.

Cause

Your CDM folder is not separated into logical and physical models, and only physical models exist in the CDM folder. The following two articles describe the difference: Logical definitions and Resolving a logical entity definition.
 

Recommendation

For the data flow using CDM as a source, try to use a logical model as your entity reference, and use the manifest that describes the location of the physical resolved entities and the data partition locations. You can see some samples of logical entity definitions within the public CDM github repository: CDM-schemaDocuments

A good starting point to forming your corpus is to copy the files within the schema documents folder (just that level inside the github repository), and put those files into a folder. Afterwards, you can use one of the predefined logical entities within the repository (as a starting or reference point) to create your logical model.

Once the corpus is set up, you are recommended to use CDM as a sink within data flows, so that a well-formed CDM folder can be properly created. You can use your CSV dataset as a source and then sink it to your CDM model that you created.

CSV and Excel format

Set the quote character to 'no quote char' is not supported in the CSV

Symptoms

There are several issues that are not supported in the CSV when the quote character is set to 'no quote char':

  1. When the quote character is set to 'no quote char', multi-char column delimiter can't start and end with the same letters.
  2. When the quote character is set to 'no quote char', multi-char column delimiter can't contain the escape character: \.
  3. When the quote character is set to 'no quote char', column value can't contain row delimiter.
  4. The quote character and the escape character cannot both be empty (no quote and no escape) if the column value contains a column delimiter.

Cause

Causes of the symptoms are stated below with examples respectively:

  1. Start and end with the same letters.
    column delimiter: $*^$*
    column value: abc$*^ def
    csv sink: abc$*^$*^$*def
    will be read as "abc" and "^&*def"

  2. The multi-char delimiter contains escape characters.
    column delimiter: \x
    escape char:\
    column value: "abc\\xdef"
    The escape character will either escape the column delimiter or the escape the character.

  3. The column value contains the row delimiter.
    We need quote character to tell if row delimiter is inside column value or not.

  4. The quote character and the escape character both be empty and the column value contains column delimiters.
    Column delimiter: \t
    column value: 111\t222\t33\t3
    It will be ambigious if it contains 3 columns 111,222,33\t3 or 4 columns 111,222,33,3.

Recommendation

The first symptom and the second symptom cannot be solved currently. For the third and fourth symptoms, you can apply the following methods:

  • For Symptom 3, do not use the 'no quote char' for a multiline csv file.
  • For Symptom 4, set either the quote character or the escape character as non-empty, or you can remove all column delimiters inside your data.

Read files with different schemas error

Symptoms

When you use data flows to read files such as CSV and Excel files with different schemas, the data flow debug, sandbox, or activity run will fail.

  • For CSV, the data misalignment exists when the schema of files is different.

    Screenshot that shows the first schema error.

  • For Excel, an error occurs when the schema of the file is different.

    Screenshot that shows the second schema error.

Cause

Reading files with different schemas in the data flow is not supported.

Recommendation

If you still want to transfer files such as CSV and Excel files with different schemas in the data flow, you can use the ways below to work around:

  • For CSV, you need to manually merge the schema of different files to get the full schema. For example, file_1 has columns c_1, c_2, c_3 while file_2 has columns c_3, c_4, ... c_10, so the merged and the full schema is c_1, c_2, ... c_10. Then make other files also have the same full schema even though it does not have data, for example, file_x only has columns c_1, c_2, c_3, c_4, please add columns c_5, c_6, ... c_10 in the file to make them consistent with the other files.

  • For Excel, you can solve this issue by applying one of the following options:

    • Option-1: You need to manually merge the schema of different files to get the full schema. For example, file_1 has columns c_1, c_2, c_3 while file_2 has columns c_3, c_4, ... c_10, so the merged and full schema is c_1, c_2, ... c_10. Then make other files also have the same schema even though it does not have data, for example, file_x with sheet "SHEET_1" only has columns c_1, c_2, c_3, c_4, please add columns c_5, c_6, ... c_10 in the sheet too, and then it can work.
    • Option-2: Use range (for example, A1:G100) + firstRowAsHeader=false, and then it can load data from all Excel files even though the column name and count is different.

Delta format

The sink does not support the schema drift with upsert or update

Symptoms

You may face the issue that the delta sink in mapping data flows does not support schema drift with upsert/update. The problem is that the schema drift does not work when the delta is the target in a mapping data flow and user configure an update/upsert. 

If a column is added to the source after an "initial" load to the delta, the subsequent jobs just fail with an error that it cannot find the new column, and this happens when you upsert/update with the alter row. It seems to work for inserts only.

Error message

DF-SYS-01 at Sink 'SnkDeltaLake': org.apache.spark.sql.AnalysisException: cannot resolve target.BICC_RV in UPDATE clause given columns target. 

Cause

This is an issue for delta format because of the limitation of io delta library used in the data flow runtime. This issue is still in fixing.

Recommendation

To solve this problem, you need to update the schema firstly and then write the data. You can follow the steps below: 

  1. Create one data flow that includes an insert-only delta sink with the merge schema option to update the schema. 
  2. After Step 1, use delete/upsert/update to modify the target sink without changing the schema. 

Snowflake

Unable to connect to the Snowflake linked service

Symptoms

You encounter the following error when you create the Snowflake linked service in the public network, and you use the auto-resolve integration runtime.

ERROR [HY000] [Microsoft][Snowflake] (4) REST request for URL https://XXXXXXXX.east-us- 2.azure.snowflakecomputing.com.snowflakecomputing.com:443/session/v1/login-request?requestId=XXXXXXXXXXXXXXXXXXXXXXXXX&request_guid=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Screenshot that shows the connection fail error.

Cause

You have not applied the account name in the format that is given in the Snowflake account document (including additional segments that identify the region and cloud platform), for example, XXXXXXXX.east-us-2.azure. You can refer to this document: Linked service properties for more information.

Recommendation

To solve the issue, change the account name format. The role should be one of the roles shown in the following picture, but the default one is Public.

Screenshot that shows the account roles.

SQL access control error: "Insufficient privileges to operate on schema"

Symptoms

When you try to use "import projection", "data preview", etc. in the Snowflake source of data flows, you meet errors like net.snowflake.client.jdbc.SnowflakeSQLException: SQL access control error: Insufficient privileges to operate on schema.

Cause

You meet this error because of the wrong configuration. When you use the data flow to read Snowflake data, the runtime Azure Databricks (ADB) is not directly select the query to Snowflake. Instead, a temporary stage are created, and data are pulled from tables to the stage and then compressed and pulled by ADB. This process is shown in the picture below.

 Screenshot that shows the Snowflake data read model.

So the user/role used in ADB should have necessary permission to do this in the Snowflake. But usually the user/role do not have the permission since the database is created on the share.

Recommendation

To solve this issue, you can create different database and create views on the top of the shared DB to access it from ADB. For more details, please refer to Snowflake.

Failed with an error: "SnowflakeSQLException: IP x.x.x.x is not allowed to access Snowflake. Contact your local security administrator"

Symptoms

When you use snowflake in Azure Data Factory, you can successfully use test-connection in the Snowflake linked service, preview-data/import-schema on Snowflake dataset and run copy/lookup/get-metadata or other activities with it. But when you use Snowflake in the data flow activity, you may meet error like SnowflakeSQLException: IP 13.66.58.164 is not allowed to access Snowflake. Contact your local security administrator.

Cause

The Azure Data Factory data flow does not support the use of fixed IP ranges. For more information, see Azure Integration Runtime IP addresses.

Recommendation

To solve this issue, you can change the Snowflake account firewall settings with the following steps:

  1. You can get the IP range list of service tags from the "service tags IP range download link": Discover service tags by using downloadable JSON files.

    Screenshot that shows the IP range list.

  2. If you run a data flow in the "southcentralus" region, you need to allow the access from all addresses with name "AzureCloud.southcentralus", for example:

    Screenshot that shows how to allow access from all addresses with the certain name.

Queries in the source does not work

Symptoms

When you try to read data from Snowflake with query, you may meet error like:

  1. SQL compilation error: error line 1 at position 7 invalid identifier 'xxx'
  2. SQL compilation error: Object 'xxx' does not exist or not authorized.

Cause

You encounter this error because of your wrong configuration.

Recommendation

For Snowflake, it applies the following rules for storing identifiers at creation/definition time and resolving them in queries and other SQL statements:

When an identifier (table name, schema name, column name, etc.) is unquoted, it is stored and resolved in uppercase by default, and it is case-in-sensitive. For example:

Screenshot that shows the example of unquoted identifier.

Because it is case-in-sensitive, so you can feel free to use following query to read snowflake data while the result is the same:

  • Select MovieID, title from Public.TestQuotedTable2
  • Select movieId, title from Public.TESTQUOTEDTABLE2
  • Select movieID, TITLE from PUBLIC.TESTQUOTEDTABLE2

When an identifier (table name, schema name, column name, etc.) is double-quoted, it is stored and resolved exactly as entered, including case as it is case-sensitive, and you can see an example in the following picture. For more details, please refer to this document: Identifier Requirements.

Screenshot that shows the example of double quoted identifier.

Because the case-sensitive identifier (table name, schema name, column name, etc.) has lowercase character, you must quote the identifier during data reading with the query, for example:

  • Select "movieId", "title" from Public."testQuotedTable2"

If you meet up error with the Snowflake query, check whether some identifiers (table name, schema name, column name, etc.) are case-sensitive with the following steps:

  1. Sign in to the Snowflake server (https://{accountName}.azure.snowflakecomputing.com/, replace {accountName} with your account name) to check the identifier (table name, schema name, column name, etc.).

  2. Create worksheets to test and validate the query:

    • Run Use database {databaseName}, replace {databaseName} with your database name.
    • Run a query with table, for example: select "movieId", "title" from Public."testQuotedTable2"
  3. After the SQL query of Snowflake is tested and validated, you can use it in the data flow Snowflake source directly.

The expression type does not match the column data type, expecting VARIANT but got VARCHAR

Symptoms

When you try to write data into the Snowflake table, you may meet the following error:

java.sql.BatchUpdateException: SQL compilation error: Expression type does not match column data type, expecting VARIANT but got VARCHAR

Cause

The column type of input data is string, which is different from the VARIANT type of the related column in the Snowflake sink.

When you store data with complex schemas (array/map/struct) in a new Snowflake table, the data flow type will be automatically converted into its physical type VARIANT.

Screenshot that shows the VARIANT type in a table.

The related values are stored as JSON strings, showing in the picture below.

Screenshot that shows the stored JSON string.

Recommendation

For the Snowflake VARIANT, it can only accept the data flow value that is struct or map or array type. If the value of your input data column is JSON or XML or other strings, use one of the following options to solve this issue:

  • Option-1: Use parse transformation before using Snowflake as a sink to covert the input data column value into struct or map or array type, for example:

    Screenshot that shows the parse transformation.

    Note

    The value of the Snowflake column with VARIANT type is read as string in Spark by default.

  • Option-2: Sign in to your Snowflake server (https://{accountName}.azure.snowflakecomputing.com/, replace {accountName} with your account name) to change the schema of your Snowflake target table. Apply the following steps by running the query under each step.

    1. Create one new column with VARCHAR to store the values.

      alter table tablename add newcolumnname varchar;
      
    2. Copy the value of VARIANT into the new column.

      update tablename t1 set newcolumnname = t1."details"
      
    3. Delete the unused VARIANT column.

      alter table tablename drop column "details";
      
    4. Rename the new column to the old name.

      alter table tablename rename column newcolumnname to "details";
      

Next steps

For more help with troubleshooting, see these resources: