In pyspark dataframe.write.json() adds extra empty line at the end.Is is good idea to write more records using in df.write.json

Code Heaven 1 Reputation point
2022-10-28T18:51:12.023+00:00

We have a dataframe with 5 columns with nearly 1.5 Billion records.

We are need to write that a single line (single records) as json.

We are facing two issues

df.write.format('json') is writing all as single line or single record but there is second line (empty line is coming) we need to avoid it
df.write.format('json').save('somedir_in_HDFS') is giving error
We want the save the file as single file so that the downstream application will read it

Here is the sample code.

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.types import *

schema = StructType([
StructField("author", StringType(), False),
StructField("title", StringType(), False),
StructField("pages", IntegerType(), False),
StructField("email", StringType(), False)
])

likewise we have 1.5 billion records

data = [
["author1", "title1", 1, "author1@Stuff .com"],
["author2", "title2", 2, "author2@Stuff .com"],
["author3", "title3", 3, "author3@Stuff .com"],
["author4", "title4", 4, "author4@Stuff .com"]
]

if name == "main":
spark=SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
df = spark.createDataFrame(data, schema)
dfprocessed=df #here we are doing lots of joins with other tables
dfprocessed = dfprocessed.agg(
f.collect_list(f.struct(f.col('author'), f.col('title'), f.col('pages'), f.col('email'))).alias("list-item"))
dfprocessed = dfprocessed.withColumn("version", f.lit("1.0").cast(IntegerType()))
dfprocessed.printSchema()
dfprocessed.write.format("json").mode("overwrite").option("escape", "").save('./TestJson')
#the above write adds one extra empty line

.NET
.NET
Microsoft Technologies based on the .NET software framework.
1,167 questions
0 comments No comments
{count} votes