Data Truncation Issue When Using COPY INTO in Databricks for Binary Column

Question

Data Truncation Issue When Using COPY INTO in Databricks for Binary Column

Suwetha Sivakumar 0

Issue:

When loading a long string value containing trailing null bytes (e.g., \x12\x34\x56\x00\x00\x00...) into a binary column using the COPY INTO statement, the data gets truncated, storing only the first few characters (\x12\x34\x56 → EjRW in Base64).

Steps to Reproduce:

Table Creation:

CREATE TABLE default.sample (_c0 binary, _c1 binary, _c2 binary, _c3 int, _c4 boolean) USING DELTA;

Copy Into Statement:

COPY INTO default.sample
FROM (SELECT 
        CAST(_c0 AS BINARY) AS _c0, 
        CAST(_c1 AS BINARY) AS _c1, 
        CAST(_c2 AS BINARY) AS _c2, 
        CAST(_c3 AS INT) AS _c3, 
        CAST(_c4 AS BOOLEAN) AS _c4 
      FROM 'dbfs:/FileStore/tables/test.csv') 
FILEFORMAT = CSV 
FORMAT_OPTIONS ('multiLine'='true', 'escape'='"') 
COPY_OPTIONS ('force' = 'true');

CSV FILE

test.txt (Attaching in txt format since csv is not allowed)

Observed Behaviour:

Expected Data Screenshot 2025-03-15 at 12.12.06 AM

Actual Data: Screenshot 2025-03-15 at 12.07.05 AM

Environment Details:

Tested the scenario is multiple runtime versions and the behaviour is as follows:

Databricks Runtime version	Status
10.4 LTS	Not Working
11.3 LTS	Not Working
12,2 LTS	Not Working
13.3 LTS	Working
14.4 LTS	Working
15.3 LTS	Not Working
16.2	Not Working

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2025-03-17T07:04:39.5266667+00:00

@Suwetha Sivakumar Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2025-03-20T06:33:09.4033333+00:00

@Suwetha Sivakumar We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

2 answers

Your answer

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2025-03-17T07:04:39.5266667+00:00

@Suwetha Sivakumar Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2025-03-20T06:33:09.4033333+00:00

@Suwetha Sivakumar We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hello Suwetha !

Thank you for posting your issue.

I think your issue is related to how different versions of Databricks Runtime handle binary data, especially when it contains trailing null bytes (\x00).

Since the issue is inconsistent across different Databricks Runtime versions, you need to use a version where this functionality is known to work (in your case 13.3 LTS or 14.4 LTS).

If you are familiar to Spark, try to use spark.read.csv followed by write.saveAsTable.


from pyspark.sql import SparkSession

from pyspark.sql.functions import col

# Initialize Spark session

spark = SparkSession.builder.appName("LoadBinaryData").getOrCreate()

# Read the CSV file

df = spark.read.csv("dbfs:/FileStore/tables/test.csv", header=True, multiLine=True, escape='"')

# Cast columns to appropriate types

df = df.withColumn("_c0", col("_c0").cast("binary")) \

       .withColumn("_c1", col("_c1").cast("binary")) \

       .withColumn("_c2", col("_c2").cast("binary")) \

       .withColumn("_c3", col("_c3").cast("int")) \

       .withColumn("_c4", col("_c4").cast("boolean"))

# Write the DataFrame to a Delta table

df.write.format("delta").saveAsTable("default.sample")

Or you can simply preprocess the CSV file :


import csv

input_file = "dbfs:/FileStore/tables/test.csv"

output_file = "dbfs:/FileStore/tables/test_processed.csv"

with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:

    reader = csv.reader(infile)

    writer = csv.writer(outfile)

    for row in reader:

        # Process binary data (e.g., remove trailing null bytes)

        processed_row = [cell.rstrip('\x00') if cell else cell for cell in row]

        writer.writerow(processed_row)

Suwetha Sivakumar 0 Reputation points

2025-03-17T08:52:28.2166667+00:00

Hi Amira,

As you can see, there are issues with the latest runtime versions. Certain features are supported only in higher runtime versions, which might require their use but can also lead to this issue. Could we identify the scenarios where this might fail and ensure it's addressed in the upcoming runtime versions?

Additionally, I observed that this issue occurs only in specific cases—the trailing null character bytes are not always truncated. Copying all four records is causing this issue, whereas copying only one or two records does not.

Also, for our use case, we prefer using the COPY INTO statement rather than Spark.

Answer 2

@Suwetha Sivakumar

Please check the below steps

Enable detailed logging when running the COPY INTO statement to capture any anomalies or patterns that occur during the data load process.

Monitor the logs for any errors or warnings that might indicate why the truncation happens only with certain records.
Create a minimal reproducible example that consistently triggers the issue. This can help the Databricks support team or engineering team to understand and address the problem.
Test with different sizes and types of binary data to identify any specific conditions that lead to truncation.
Provide detailed information about the issue, including the runtime versions affected, the specific scenarios where the issue occurs, and any logs or error messages.

Request that this issue be prioritized for a fix in upcoming runtime versions.

Since you prefer using the COPY INTO statement, here are a few additional suggestions:

Data Preprocessing:

Preprocess the CSV file to ensure that trailing null bytes are handled correctly before using the COPY INTO statement. This can help mitigate the issue until a fix is available.

Batch Processing:

If the issue occurs with larger batches of records, try splitting the data into smaller batches and loading them incrementally. This might help identify if there's a specific threshold causing the problem.

Custom Scripts:

Consider writing a custom script to preprocess and load the data using the COPY INTO statement. This script can handle any specific edge cases or anomalies in the data.

Hope this helps. Do let us know if you any further queries.

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2025-03-24T09:18:09.0933333+00:00

@Suwetha Sivakumar Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Data Truncation Issue When Using COPY INTO in Databricks for Binary Column

Issue:

Steps to Reproduce:

Observed Behaviour:

Environment Details:

2 answers

Your answer