How to substring value from one of column from parquet file

Mayur Patil 35 Reputation points
2024-12-05T13:46:23.97+00:00

Hi,

I am using the upsert method in a copy activity. We have a column with more than 4000 characters, so we defined it as nvarchar(max) in the Synapse table. However, the copy activity is causing issues because it's creating an intermediate table with a columnstore index. If I reduce the length to 4,000 characters, it fails due to string truncation. How can I fetch only the first 4,000 characters from the source which is in adls in parquet format before inserting them into the table without using data flow?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

Accepted answer
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2024-12-05T17:32:57.77+00:00

    If the source allows querying (for example SQL Database or a similar interface), use a query to truncate the value to the first 4,000 characters before passing it to the copy activity.

    SELECT SUBSTRING(your_column, 1, 4000) AS truncated_column, other_columns FROM your_table
    

    In the Mapping tab of the Copy Activity:

    • Map the problematic column (your_column) to a new computed column or directly to the target column.
    • For transformation, ADF Copy Activity doesn't natively support substring functions. However, you can manage truncation by:
      • Writing an inline query for intermediate staging.
      • Pre-process the Parquet file.

    Or if your source is Parquet and no query can be directly applied, use a Script Activity or an Azure Function:

    import pandas as pd
    # Read Parquet
    df = pd.read_parquet('path_to_file.parquet')
    # Truncate the column
    df['your_column'] = df['your_column'].str[:4000]
    # Save back to Parquet or CSV
    df.to_parquet('output_path.parquet', index=False)
    

    If you prefer a SQL-based approach:

    • Load the data into a staging table with VARCHAR(MAX) or a pre-defined column length.
    • Use an INSERT INTO SELECT query to truncate the value during the transfer to the final table:
    INSERT INTO final_table (column1, truncated_column, ...)
    SELECT column1, LEFT(your_column, 4000), ...
    FROM staging_table;
    

    If your pipeline includes Synapse Analytics, execute a stored procedure that processes the data after copying it to a staging table:

    CREATE PROCEDURE TruncateLargeColumn
    AS
    BEGIN
        INSERT INTO final_table
        SELECT column1, LEFT(large_column, 4000), ...
        FROM staging_table;
    END
    
    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.