Data Truncation Issue When Using COPY INTO in Databricks for Binary Column

Suwetha Sivakumar 0 Reputation points
2025-03-14T18:51:59.73+00:00

Issue:

When loading a long string value containing trailing null bytes (e.g., \x12\x34\x56\x00\x00\x00...) into a binary column using the COPY INTO statement, the data gets truncated, storing only the first few characters (\x12\x34\x56EjRW in Base64).

Steps to Reproduce:

Table Creation:

CREATE TABLE default.sample (_c0 binary, _c1 binary, _c2 binary, _c3 int, _c4 boolean) USING DELTA;

Copy Into Statement:

COPY INTO default.sample
FROM (SELECT 
        CAST(_c0 AS BINARY) AS _c0, 
        CAST(_c1 AS BINARY) AS _c1, 
        CAST(_c2 AS BINARY) AS _c2, 
        CAST(_c3 AS INT) AS _c3, 
        CAST(_c4 AS BOOLEAN) AS _c4 
      FROM 'dbfs:/FileStore/tables/test.csv') 
FILEFORMAT = CSV 
FORMAT_OPTIONS ('multiLine'='true', 'escape'='"') 
COPY_OPTIONS ('force' = 'true');

CSV FILE

test.txt (Attaching in txt format since csv is not allowed)

Observed Behaviour:

Expected DataScreenshot 2025-03-15 at 12.12.06 AM

Actual Data:Screenshot 2025-03-15 at 12.07.05 AM

Environment Details:

Tested the scenario is multiple runtime versions and the behaviour is as follows:

Databricks Runtime version Status
10.4 LTS Not Working
11.3 LTS Not Working
12,2 LTS Not Working
13.3 LTS Working
14.4 LTS Working
15.3 LTS Not Working
16.2 Not Working
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2025-03-16T13:58:27.6166667+00:00

    Hello Suwetha !

    Thank you for posting your issue.

    I think your issue is related to how different versions of Databricks Runtime handle binary data, especially when it contains trailing null bytes (\x00).

    Since the issue is inconsistent across different Databricks Runtime versions, you need to use a version where this functionality is known to work (in your case 13.3 LTS or 14.4 LTS).

    If you are familiar to Spark, try to use spark.read.csv followed by write.saveAsTable.

    
    from pyspark.sql import SparkSession
    
    from pyspark.sql.functions import col
    
    # Initialize Spark session
    
    spark = SparkSession.builder.appName("LoadBinaryData").getOrCreate()
    
    # Read the CSV file
    
    df = spark.read.csv("dbfs:/FileStore/tables/test.csv", header=True, multiLine=True, escape='"')
    
    # Cast columns to appropriate types
    
    df = df.withColumn("_c0", col("_c0").cast("binary")) \
    
           .withColumn("_c1", col("_c1").cast("binary")) \
    
           .withColumn("_c2", col("_c2").cast("binary")) \
    
           .withColumn("_c3", col("_c3").cast("int")) \
    
           .withColumn("_c4", col("_c4").cast("boolean"))
    
    # Write the DataFrame to a Delta table
    
    df.write.format("delta").saveAsTable("default.sample")
    

    Or you can simply preprocess the CSV file :

    
    import csv
    
    input_file = "dbfs:/FileStore/tables/test.csv"
    
    output_file = "dbfs:/FileStore/tables/test_processed.csv"
    
    with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
    
        reader = csv.reader(infile)
    
        writer = csv.writer(outfile)
    
        for row in reader:
    
            # Process binary data (e.g., remove trailing null bytes)
    
            processed_row = [cell.rstrip('\x00') if cell else cell for cell in row]
    
            writer.writerow(processed_row)
    

  2. phemanth 15,755 Reputation points Microsoft External Staff Moderator
    2025-03-19T07:51:54.5266667+00:00

    @Suwetha Sivakumar

    Please check the below steps

    Enable detailed logging when running the COPY INTO statement to capture any anomalies or patterns that occur during the data load process.

    • Monitor the logs for any errors or warnings that might indicate why the truncation happens only with certain records.
    • Create a minimal reproducible example that consistently triggers the issue. This can help the Databricks support team or engineering team to understand and address the problem.
    • Test with different sizes and types of binary data to identify any specific conditions that lead to truncation.
    • Provide detailed information about the issue, including the runtime versions affected, the specific scenarios where the issue occurs, and any logs or error messages.

    Request that this issue be prioritized for a fix in upcoming runtime versions.

    Since you prefer using the COPY INTO statement, here are a few additional suggestions:

    Data Preprocessing:

    • Preprocess the CSV file to ensure that trailing null bytes are handled correctly before using the COPY INTO statement. This can help mitigate the issue until a fix is available.

    Batch Processing:

    • If the issue occurs with larger batches of records, try splitting the data into smaller batches and loading them incrementally. This might help identify if there's a specific threshold causing the problem.

    Custom Scripts:

    • Consider writing a custom script to preprocess and load the data using the COPY INTO statement. This script can handle any specific edge cases or anomalies in the data.

    Hope this helps. Do let us know if you any further queries.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.