Share via


DataFrameWriter class

Interface used to write a DataFrame to external storage systems (e.g. file systems, key-value stores, etc).

Supports Spark Connect

Syntax

Use DataFrame.write to access this interface.

Methods

Method Description
mode(saveMode) Specifies the behavior when data or table already exists.
format(source) Specifies the underlying output data source.
option(key, value) Adds an output option for the underlying data source.
options(**options) Adds output options for the underlying data source.
partitionBy(*cols) Partitions the output by the given columns on the file system.
bucketBy(numBuckets, col, *cols) Buckets the output by the given columns.
sortBy(col, *cols) Sorts the output in each bucket by the given columns on the file system.
clusterBy(*cols) Clusters the data by the given columns to optimize query performance.
save(path, format, mode, partitionBy, **options) Saves the contents of the DataFrame to a data source.
insertInto(tableName, overwrite) Inserts the content of the DataFrame to the specified table.
saveAsTable(name, format, mode, partitionBy, **options) Saves the content of the DataFrame as the specified table.
json(path, mode, compression, ...) Saves the content of the DataFrame in JSON format at the specified path.
parquet(path, mode, partitionBy, compression) Saves the content of the DataFrame in Parquet format at the specified path.
text(path, compression, lineSep) Saves the content of the DataFrame in a text file at the specified path.
csv(path, mode, compression, sep, ...) Saves the content of the DataFrame in CSV format at the specified path.
xml(path, rowTag, mode, ...) Saves the content of the DataFrame in XML format at the specified path.
orc(path, mode, partitionBy, compression) Saves the content of the DataFrame in ORC format at the specified path.
excel(path, mode, dataAddress, headerRows) Saves the content of the DataFrame in excel format at the specified path.
jdbc(url, table, mode, properties) Saves the content of the DataFrame to an external database table via JDBC.

Save Modes

The mode() method supports the following options:

  • append: Append contents of this DataFrame to existing data.
  • overwrite: Overwrite existing data.
  • error or errorifexists: Throw an exception if data already exists (default).
  • ignore: Silently ignore this operation if data already exists.

Examples

Writing to different data sources

# Access DataFrameWriter through DataFrame
df = spark.createDataFrame([{"name": "Alice", "age": 30}])
df.write

# Write to JSON file
df.write.json("path/to/output.json")

# Write to CSV file with options
df.write.option("header", "true").csv("path/to/output.csv")

# Write to Parquet file
df.write.parquet("path/to/output.parquet")

# Write to a table
df.write.saveAsTable("table_name")

Using format and save

# Specify format explicitly
df.write.format("json").save("path/to/output.json")

# With options
df.write.format("csv") \
    .option("header", "true") \
    .option("compression", "gzip") \
    .save("path/to/output.csv")

Specifying save mode

# Overwrite existing data
df.write.mode("overwrite").parquet("path/to/output.parquet")

# Append to existing data
df.write.mode("append").parquet("path/to/output.parquet")

# Ignore if data exists
df.write.mode("ignore").json("path/to/output.json")

# Error if data exists (default)
df.write.mode("error").csv("path/to/output.csv")

Partitioning data

# Partition by single column
df.write.partitionBy("year").parquet("path/to/output.parquet")

# Partition by multiple columns
df.write.partitionBy("year", "month").parquet("path/to/output.parquet")

# Partition with bucketing
df.write \
    .bucketBy(10, "id") \
    .sortBy("age") \
    .saveAsTable("bucketed_table")

Writing to JDBC

# Write to database table
df.write.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    mode="overwrite",
    properties={"user": "myuser", "password": "mypassword"}
)

Method chaining

# Chain multiple configuration methods
df.write \
    .format("parquet") \
    .mode("overwrite") \
    .option("compression", "snappy") \
    .partitionBy("year", "month") \
    .save("path/to/output")

Writing to tables

# Save as managed table
df.write.saveAsTable("my_table")

# Save as managed table with options
df.write \
    .mode("overwrite") \
    .format("parquet") \
    .partitionBy("year") \
    .saveAsTable("partitioned_table")

# Insert into existing table
df.write.insertInto("existing_table")

# Insert into existing table with overwrite
df.write.insertInto("existing_table", overwrite=True)