Source value error in reusable dataflow

Question

Source value error in reusable dataflow

Kleber Rebello 0

Hi,

We have a REUSABLE ADF dataflow loading a (parameterized) Synapse table from a (also parameterized) parquet file.

It failed with a source value greater than the 8,000 maximum character size Synapse accommodates. Our source is Salesforce, which does not have a substring function we could resort to address this.

Do we have an option to have the dataflow simply truncate the source value by default ?

Please advise.

Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-04-09T02:29:30.71+00:00
@Kleber Rebello

In Azure Data Factory (ADF) Mapping Data Flows, when loading data into Azure Synapse Analytics, ADF does not automatically truncate source values that exceed column limits (e.g., 8,000 characters for VARCHAR). If the data length exceeds what Synapse supports, it results in a failure during data insertion.

Recommended approach:

You can use a Derived Column transformation in your data flow to manually truncate the string values before writing to Synapse. substring(columnName, 0, 8000)

To apply truncation only when needed: iif(length(columnName) > 8000, substring(columnName, 0, 8000), columnName)

Reference: substring() function - ADF Mapping Data Flow

Derived Column transformation

Since you're using a reusable data flow, if your column names are parameterized, you can either:

Apply the truncation logic to specific columns where this is likely to occur.

Or create a transformation with logic that evaluates multiple columns dynamically (to the extent supported).

Note: There is no built-in option in ADF to auto-truncate values at the sink side. It must be handled manually via transformations before writing to Synapse.

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Kleber Rebello 0 Reputation points

2025-04-09T11:03:16.73+00:00
Hi, @Anonymous .

Regarding your reply:

"Since you're using a reusable data flow, if your column names are parameterized, you can either:

Apply the truncation logic to specific columns where this is likely to occur.

Or create a transformation with logic that evaluates multiple columns dynamically (to the extent supported)."

One of the tables the reusable dataflow is used to load has over 250 columns ! Does that mean we will have to build logic for each column where truncation could potentially occur ?
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-04-10T05:52:56.6833333+00:00

@Kleber Rebello

When dealing with a table that has over 250 columns, defining truncation logic for each column manually would not be scalable or efficient.

Unfortunately, ADF Mapping Data Flows do not currently support dynamic iteration over all columns (like a for-each loop or a dynamic schema mapping in expressions). This means that if you want to apply truncation using substring(), it does have to be defined per column, even in a reusable data flow.

Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

I hope this information helps!
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-04-11T04:31:28.7166667+00:00

@Kleber Rebello

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Kleber Rebello 0 Reputation points

2025-04-11T11:09:12.2666667+00:00

"if you want to apply truncation using substring(), it does have to be defined per column, even in a reusable data flow."

That does not help much, sorry.

Customer feedback: It would be simpler if a dataflow could truncate by default or if it had a checkbox as an option to do it.
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff Moderator

2025-04-15T10:42:19.1+00:00
Hello Kleber Rebello
To apply a trimming rule to all string-type columns within a dataset using Azure Data Factory Mapping Data Flows, you can leverage the column pattern feature in a Derived Column transformation.

Here’s how to do it:

In the Derived Column transformation, configure the column pattern rule with the condition: type == 'string'

For the column name, use $$ to retain each column’s original name.

In the expression/value field, apply the trim() function like so: trim($$,'...') — this will clean leading and trailing characters as needed across all string columns.
Know more:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column
Kleber Rebello 0 Reputation points

2025-04-15T12:20:49.45+00:00

"apply a trimming rule to all string-type columns"

Sorry but this does not work in a REUSABLE dataflow which processes multiple tables with different layouts.
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff Moderator

2025-04-16T09:27:19.0133333+00:00
Hello@Kleber Rebello
As you mentioned

It failed with a source value greater than the 8,000 maximum character size Synapse accommodates.

If you require storing more than 8,000 characters in a column, consider using VARCHAR(MAX), which supports up to 2 GB of text data.

For example

XYZ VARCHAR(MAX)

This allows for storing up to 2 billion characters
Kleber Rebello 0 Reputation points

2025-04-16T19:43:24.4133333+00:00

We are migrating from another ETL tool which constrains the size to 8,000 characters (ie, truncates by default).

We would like to get ADF to do exactly the same.
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff Moderator

2025-04-17T16:06:45.3433333+00:00

@Kleber Rebello
If you have data of list of columns which are having character more than 8000, then you can create a parameter with datatype string array, values as column names as [col1,col2,col3]. Now add derived Column and add expression to identify the columns add truncation logic on that columns.

iif(length($$) > 8000, substring($$, 0, 8000), $$)
Kleber Rebello 0 Reputation points

2025-04-17T16:22:00.05+00:00

I have a mix of columns of different data types and one of the sources have over 250 columns. The failure happened to a couple of VARCHAR columns but it may happen to any other VARCHAR columns in the future. We do not control the source system data.

1 answer

Your answer

Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-04-09T02:29:30.71+00:00

@Kleber Rebello

In Azure Data Factory (ADF) Mapping Data Flows, when loading data into Azure Synapse Analytics, ADF does not automatically truncate source values that exceed column limits (e.g., 8,000 characters for VARCHAR). If the data length exceeds what Synapse supports, it results in a failure during data insertion.

Recommended approach:

You can use a Derived Column transformation in your data flow to manually truncate the string values before writing to Synapse. substring(columnName, 0, 8000)

To apply truncation only when needed: iif(length(columnName) > 8000, substring(columnName, 0, 8000), columnName)

Reference: substring() function - ADF Mapping Data Flow

Derived Column transformation

Since you're using a reusable data flow, if your column names are parameterized, you can either:

Apply the truncation logic to specific columns where this is likely to occur.

Or create a transformation with logic that evaluates multiple columns dynamically (to the extent supported).

Note: There is no built-in option in ADF to auto-truncate values at the sink side. It must be handled manually via transformations before writing to Synapse.

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Kleber Rebello 0 Reputation points

2025-04-09T11:03:16.73+00:00

Hi, @Anonymous .

Regarding your reply:

"Since you're using a reusable data flow, if your column names are parameterized, you can either:

Apply the truncation logic to specific columns where this is likely to occur.

Or create a transformation with logic that evaluates multiple columns dynamically (to the extent supported)."

One of the tables the reusable dataflow is used to load has over 250 columns ! Does that mean we will have to build logic for each column where truncation could potentially occur ?
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-04-10T05:52:56.6833333+00:00

@Kleber Rebello

When dealing with a table that has over 250 columns, defining truncation logic for each column manually would not be scalable or efficient.

Unfortunately, ADF Mapping Data Flows do not currently support dynamic iteration over all columns (like a for-each loop or a dynamic schema mapping in expressions). This means that if you want to apply truncation using substring(), it does have to be defined per column, even in a reusable data flow.

Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

I hope this information helps!
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-04-11T04:31:28.7166667+00:00

@Kleber Rebello

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Kleber Rebello 0 Reputation points

2025-04-11T11:09:12.2666667+00:00

"if you want to apply truncation using substring(), it does have to be defined per column, even in a reusable data flow."

That does not help much, sorry.

Customer feedback: It would be simpler if a dataflow could truncate by default or if it had a checkbox as an option to do it.
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff Moderator

2025-04-15T10:42:19.1+00:00

Hello Kleber Rebello
To apply a trimming rule to all string-type columns within a dataset using Azure Data Factory Mapping Data Flows, you can leverage the column pattern feature in a Derived Column transformation.

Here’s how to do it:

In the Derived Column transformation, configure the column pattern rule with the condition: type == 'string'

For the column name, use $$ to retain each column’s original name.

In the expression/value field, apply the trim() function like so: trim($$,'...') — this will clean leading and trailing characters as needed across all string columns.
Know more:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column
Kleber Rebello 0 Reputation points

2025-04-15T12:20:49.45+00:00

"apply a trimming rule to all string-type columns"

Sorry but this does not work in a REUSABLE dataflow which processes multiple tables with different layouts.
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff Moderator

2025-04-16T09:27:19.0133333+00:00

Hello@Kleber Rebello
As you mentioned

It failed with a source value greater than the 8,000 maximum character size Synapse accommodates.

If you require storing more than 8,000 characters in a column, consider using VARCHAR(MAX), which supports up to 2 GB of text data.

For example

XYZ VARCHAR(MAX)

This allows for storing up to 2 billion characters
Kleber Rebello 0 Reputation points

2025-04-16T19:43:24.4133333+00:00

We are migrating from another ETL tool which constrains the size to 8,000 characters (ie, truncates by default).

We would like to get ADF to do exactly the same.
Singireddi JagadishKumar 0 Reputation points Microsoft External Staff Moderator

2025-04-17T16:06:45.3433333+00:00

@Kleber Rebello
If you have data of list of columns which are having character more than 8000, then you can create a parameter with datatype string array, values as column names as [col1,col2,col3]. Now add derived Column and add expression to identify the columns add truncation logic on that columns.

iif(length($$) > 8000, substring($$, 0, 8000), $$)
Kleber Rebello 0 Reputation points

2025-04-17T16:22:00.05+00:00

I have a mix of columns of different data types and one of the sources have over 250 columns. The failure happened to a couple of VARCHAR columns but it may happen to any other VARCHAR columns in the future. We do not control the source system data.

Answer 1

HI Kleber Rebello ,

Thankyou for using Microsoft Q&A platform and thanks for posting your query here.

As per my understanding you are trying to truncate values stored in multiple columns from source and load into parquet files.

As is mentioned by other members, there is no direct way to achieve this in dataflow. The best possible approach in dataflow is already shared above. Since your requirement is to have flexible approach to consider multiple columns for truncation without having knowledge about the datatype of the columns from multiple tables, I would suggest you to try custom code using python and run it using azure functions which can be triggered via ADF.

Below is the sample code which would be better approach in your case to have dynamic solution. Kindly modify as per the source schema :

import pandas as pd

def truncate_columns(df, columns, max_length=8000):
    """
    Truncate values in specified columns to a maximum length.
    
    Parameters:
    df (pd.DataFrame): DataFrame containing the data.
    columns (list): List of column names to be truncated.
    max_length (int): Maximum length of the values. Default is 8000.
    
    Returns:
    pd.DataFrame: DataFrame with truncated values.
    """
    for column in columns:
        if column in df.columns:
            df[column] = df[column].apply(lambda x: x[:max_length] if isinstance(x, str) and len(x) > max_length else x)
    return df

def process_files(input_files, output_files, columns_to_truncate):
    """
    Process multiple files to truncate values in specified columns.
    
    Parameters:
    input_files (list): List of input file paths.
    output_files (list): List of output file paths.
    columns_to_truncate (list): List of column names to be truncated.
    """
    for input_file, output_file in zip(input_files, output_files):
        # Read the data from the source file
        df = pd.read_parquet(input_file)
        
        # Truncate values in specified columns
        df = truncate_columns(df, columns_to_truncate)
        
        # Write the processed data back to a new file
        df.to_parquet(output_file)

# Example usage
input_files = ['source_file1.parquet', 'source_file2.parquet']
output_files = ['processed_file1.parquet', 'processed_file2.parquet']
columns_to_truncate = ['column1', 'column2', 'column3']

process_files(input_files, output_files, columns_to_truncate)

You can also adapt this script to run as an Azure Function, allowing you to process data dynamically based on HTTP requests.

import logging
import azure.functions as func
import pandas as pd

def truncate_columns(df, columns, max_length=8000):
    for column in columns:
        if column in df.columns:
            df[column] = df[column].apply(lambda x: x[:max_length] if isinstance(x, str) and len(x) > max_length else x)
    return df

def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    # Read the data from the request
    data = req.get_json()
    df = pd.DataFrame(data)
    columns_to_truncate = req.params.get('columns').split(',')

    # Truncate values in specified columns
    df = truncate_columns(df, columns_to_truncate)

    # Convert the DataFrame back to JSON
    processed_data = df.to_json(orient='records')

    return func.HttpResponse(processed_data, mimetype='application/json')

Hope it helps. Kindly accept the answer by clicking on Accept answer button. Thankyou

Kleber Rebello 0 Reputation points

2025-04-21T14:55:24.0333333+00:00

Thanks but we would need something simpler that truncates the data when it goes over the Synapse max length. It would be best if ADF could be improved to behave like other ETL tools which truncate by default. We have opened a case with MS and there is an engineer assigned to look into this.
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2025-04-21T15:08:41.9766667+00:00

Hi Kleber Rebello ,

Could you please share the case number for us to track better . Thanks
Kleber Rebello 0 Reputation points

2025-04-21T18:42:50.6633333+00:00

TrackingID#2504080040011058

Share via

Source value error in reusable dataflow

1 answer

Your answer