Hello Suwetha !
Thank you for posting your issue.
I think your issue is related to how different versions of Databricks Runtime handle binary data, especially when it contains trailing null bytes (\x00
).
Since the issue is inconsistent across different Databricks Runtime versions, you need to use a version where this functionality is known to work (in your case 13.3 LTS or 14.4 LTS).
If you are familiar to Spark, try to use spark.read.csv
followed by write.saveAsTable
.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark session
spark = SparkSession.builder.appName("LoadBinaryData").getOrCreate()
# Read the CSV file
df = spark.read.csv("dbfs:/FileStore/tables/test.csv", header=True, multiLine=True, escape='"')
# Cast columns to appropriate types
df = df.withColumn("_c0", col("_c0").cast("binary")) \
.withColumn("_c1", col("_c1").cast("binary")) \
.withColumn("_c2", col("_c2").cast("binary")) \
.withColumn("_c3", col("_c3").cast("int")) \
.withColumn("_c4", col("_c4").cast("boolean"))
# Write the DataFrame to a Delta table
df.write.format("delta").saveAsTable("default.sample")
Or you can simply preprocess the CSV file :
import csv
input_file = "dbfs:/FileStore/tables/test.csv"
output_file = "dbfs:/FileStore/tables/test_processed.csv"
with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
# Process binary data (e.g., remove trailing null bytes)
processed_row = [cell.rstrip('\x00') if cell else cell for cell in row]
writer.writerow(processed_row)