Read and write ORC files

Apache ORC is a columnar file format optimized for large-scale analytical workloads. Its columnar storage, built-in indexes, and statistics allow query engines to skip irrelevant data and read only the columns needed, making it significantly more efficient than row-based formats like CSV or JSON for read-heavy workloads. Azure Databricks supports ORC for both reading and writing with Apache Spark, including schema specification, partitioning, and compression.

Prerequisites

Azure Databricks does not require additional configuration to use ORC files. However, to stream ORC files, you need Auto Loader.

Options

Use the .option() and .options() methods of DataFrameReader and DataFrameWriter to configure ORC data sources. For a complete list of supported options, see DataFrameReader ORC options and DataFrameWriter ORC options.

Set ORC compression

When writing ORC files to cloud storage, compression reduces storage costs and can improve query performance by reducing I/O.

Configure compression using the compression write option. The default is snappy.

Codec	Description
`none`	No compression.
`snappy`	Optimized for speed with moderate compression. Good default for most workloads.
`zlib`	Higher compression ratio than `snappy` at the cost of additional CPU time.
`lzo`	Fast decompression, lower compression ratio.

For example, write the Wanderbricks reviews to reviews_orc_compressed using zlib compression.

Python

df = spark.read.table("samples.wanderbricks.reviews")
df.write.format("orc").option("compression", "zlib").save("/Volumes/<catalog>/<schema>/<volume>/reviews_orc_compressed")

Scala

val df = spark.read.table("samples.wanderbricks.reviews")
df.write.format("orc").option("compression", "zlib").save("/Volumes/<catalog>/<schema>/<volume>/reviews_orc_compressed")

SQL

CREATE TABLE reviews_orc_compressed
USING ORC
OPTIONS (compression 'zlib')
AS SELECT * FROM samples.wanderbricks.reviews;

Usage

The following examples use the Wanderbricks sample dataset to demonstrate reading and writing ORC files using the Spark DataFrame API and SQL.

Read and write ORC files

Python

# Write wanderbricks reviews to ORC format
df = spark.read.table("samples.wanderbricks.reviews")
df.write.format("orc").save("/Volumes/<catalog>/<schema>/<volume>/reviews_orc")

# Read an ORC file into a DataFrame
df = spark.read.format("orc").load("/Volumes/<catalog>/<schema>/<volume>/reviews_orc")
display(df)

# Write with overwrite mode
df.write.format("orc").mode("overwrite").save("/Volumes/<catalog>/<schema>/<volume>/reviews_orc")

Scala

// Write wanderbricks reviews to ORC format
val reviews = spark.read.table("samples.wanderbricks.reviews")
reviews.write.format("orc").save("/Volumes/<catalog>/<schema>/<volume>/reviews_orc")

// Read an ORC file into a DataFrame
val df = spark.read.format("orc").load("/Volumes/<catalog>/<schema>/<volume>/reviews_orc")
df.show()

// Write with overwrite mode
df.write.format("orc").mode("overwrite").save("/Volumes/<catalog>/<schema>/<volume>/reviews_orc")

SQL

-- Write wanderbricks reviews to ORC format
CREATE TABLE reviews_orc
USING ORC
AS SELECT * FROM samples.wanderbricks.reviews;

SELECT * FROM reviews_orc;

Read ORC files using SQL

Use read_files to query ORC files directly from cloud storage using SQL without creating a table.

SELECT * FROM read_files(
  '/Volumes/<catalog>/<schema>/<volume>/reviews_orc',
  format => 'orc'
)

Specify a schema

Specify a schema when reading ORC files to avoid the overhead of schema inference. For example, define a schema with review_id, rating, and comment fields and read reviews_orc into a DataFrame.

Python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("review_id", StringType(), True),
    StructField("rating", IntegerType(), True),
    StructField("comment", StringType(), True)
])

df = spark.read.format("orc").schema(schema).load("/Volumes/<catalog>/<schema>/<volume>/reviews_orc")
df.printSchema()
df.show()

Scala

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val schema = StructType(Array(
  StructField("review_id", StringType, nullable = true),
  StructField("rating", IntegerType, nullable = true),
  StructField("comment", StringType, nullable = true)
))

val df = spark.read.format("orc").schema(schema).load("/Volumes/<catalog>/<schema>/<volume>/reviews_orc")
df.printSchema()
df.show()

SQL

-- Create a table with an explicit schema from ORC files
CREATE TABLE reviews_orc (
  review_id STRING,
  rating INT,
  comment STRING
)
USING ORC
OPTIONS (path "/Volumes/<catalog>/<schema>/<volume>/reviews_orc");

SELECT * FROM reviews_orc;

Write partitioned ORC files

Write partitioned ORC files for optimized query performance on large datasets. For example, read samples.wanderbricks.bookings and write it to bookings_orc_partitioned partitioned by year and month derived from the check_in column.

Python

from pyspark.sql.functions import year, month

df = spark.read.table("samples.wanderbricks.bookings")
df_with_parts = df.withColumn("year", year("check_in")).withColumn("month", month("check_in"))
df_with_parts.write.format("orc").partitionBy("year", "month").save("/Volumes/<catalog>/<schema>/<volume>/bookings_orc_partitioned")

Scala

import org.apache.spark.sql.functions.{year, month}

val bookings = spark.read.table("samples.wanderbricks.bookings")
val bookingsWithParts = bookings.withColumn("year", year(col("check_in"))).withColumn("month", month(col("check_in")))
bookingsWithParts.write.format("orc").partitionBy("year", "month").save("/Volumes/<catalog>/<schema>/<volume>/bookings_orc_partitioned")

SQL

-- Write partitioned ORC files by year and month
CREATE TABLE bookings_orc_partitioned
USING ORC
PARTITIONED BY (year, month)
AS SELECT *, year(check_in) AS year, month(check_in) AS month
FROM samples.wanderbricks.bookings;

Additional resources

What is Delta Lake in Azure Databricks?: If you are migrating from a Hive or Hadoop environment using ORC, Delta Lake is the recommended Databricks-native format. It adds ACID transactions, schema enforcement, time travel, and optimized read performance on top of Parquet-based storage.
Read and write Parquet files: If your workload requires the broadest ecosystem compatibility outside of Databricks, Parquet is the most widely supported columnar format across query engines and cloud storage tools.

คำติชม

หน้านี้มีประโยชน์หรือไม่

Last updated on 2026-06-15

หน้านี้มีประโยชน์หรือไม่