How to substring value from one of column from parquet file

Question

How to substring value from one of column from parquet file

Mayur Patil 35

Hi,

I am using the upsert method in a copy activity. We have a column with more than 4000 characters, so we defined it as nvarchar(max) in the Synapse table. However, the copy activity is causing issues because it's creating an intermediate table with a columnstore index. If I reduce the length to 4,000 characters, it fails due to string truncation. How can I fetch only the first 4,000 characters from the source which is in adls in parquet format before inserting them into the table without using data flow?

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-12-09T16:39:56.4666667+00:00

@Mayur Patil Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

0 additional answers

Your answer

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-12-09T16:39:56.4666667+00:00

@Mayur Patil Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

If the source allows querying (for example SQL Database or a similar interface), use a query to truncate the value to the first 4,000 characters before passing it to the copy activity.

SELECT SUBSTRING(your_column, 1, 4000) AS truncated_column, other_columns FROM your_table

In the Mapping tab of the Copy Activity:

Map the problematic column (your_column) to a new computed column or directly to the target column.
For transformation, ADF Copy Activity doesn't natively support substring functions. However, you can manage truncation by:
- Writing an inline query for intermediate staging.
- Pre-process the Parquet file.

Or if your source is Parquet and no query can be directly applied, use a Script Activity or an Azure Function:

import pandas as pd
# Read Parquet
df = pd.read_parquet('path_to_file.parquet')
# Truncate the column
df['your_column'] = df['your_column'].str[:4000]
# Save back to Parquet or CSV
df.to_parquet('output_path.parquet', index=False)

If you prefer a SQL-based approach:

Load the data into a staging table with VARCHAR(MAX) or a pre-defined column length.
Use an INSERT INTO SELECT query to truncate the value during the transfer to the final table:

INSERT INTO final_table (column1, truncated_column, ...)
SELECT column1, LEFT(your_column, 4000), ...
FROM staging_table;

If your pipeline includes Synapse Analytics, execute a stored procedure that processes the data after copying it to a staging table:

CREATE PROCEDURE TruncateLargeColumn
AS
BEGIN
    INSERT INTO final_table
    SELECT column1, LEFT(large_column, 4000), ...
    FROM staging_table;
END

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-12-10T17:47:49.9633333+00:00

@Mayur Patil Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

How to substring value from one of column from parquet file

0 additional answers

Your answer