แชร์ผ่าน


Work with ORC files

Apache ORC is a columnar file format that provides optimizations for speeding up queries. It is more efficient than CSV or JSON. Azure Databricks supports ORC for both reading and writing with Apache Spark. For more information, see the Apache Spark documentation on ORC Files.

Prerequisites

Azure Databricks does not require additional configuration to use ORC files. However, to stream ORC files, you need Auto Loader.

Configure and use ORC with the DataFrame API

Use the Apache Spark DataFrame API to read and write ORC files when you need full control over schema, partitioning, or write behavior.

Read and write options

Review the following Apache Spark reference articles for supported DataFrame API read and write options.

Read and write ORC files

For example, read data.orc into a DataFrame df and write it to orc_output.

Python

# Read an ORC file into a DataFrame
df = spark.read.format("orc").load("/tmp/data.orc")
df.show()

# Write a DataFrame to ORC format
df.write.format("orc").save("/tmp/orc_output")

# Write with overwrite mode
df.write.format("orc").mode("overwrite").save("/tmp/orc_output")

Scala

// Read an ORC file into a DataFrame
val df = spark.read.format("orc").load("/tmp/data.orc")
df.show()

// Write a DataFrame to ORC format
df.write.format("orc").save("/tmp/orc_output")

// Write with overwrite mode
df.write.format("orc").mode("overwrite").save("/tmp/orc_output")

SQL

-- Query ORC files directly
SELECT * FROM orc.`/tmp/data.orc`;

-- Create a table from ORC files
CREATE TABLE orc_table
USING ORC
OPTIONS (path "/tmp/data.orc");

SELECT * FROM orc_table;

Read ORC files with schema specification

Specify a schema when reading ORC files to avoid the overhead of schema inference. For example, define a schema with name, age, and city fields and read data.orc into a DataFrame df.

Python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])

df = spark.read.format("orc").schema(schema).load("/tmp/data.orc")
df.printSchema()
df.show()

Scala

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val schema = StructType(Array(
  StructField("name", StringType, nullable = true),
  StructField("age", IntegerType, nullable = true),
  StructField("city", StringType, nullable = true)
))

val df = spark.read.format("orc").schema(schema).load("/tmp/data.orc")
df.printSchema()
df.show()

SQL

-- Create a table with an explicit schema from ORC files
CREATE TABLE orc_table (
  name STRING,
  age INT,
  city STRING
)
USING ORC
OPTIONS (path "/tmp/data.orc");

SELECT * FROM orc_table;

Write partitioned ORC files

Write partitioned ORC files for optimized query performance on large datasets. For example, create a DataFrame df with year, month, name, and amount columns and write it to partitioned_orc partitioned by year and month.

Python

df = spark.createDataFrame(
    [
        (2023, 1, "Alice", 100),
        (2023, 1, "Bob", 200),
        (2023, 2, "Alice", 150),
        (2024, 1, "Alice", 300),
    ],
    ["year", "month", "name", "amount"]
)

# Write partitioned by year and month
df.write.format("orc").partitionBy("year", "month").save("/tmp/partitioned_orc")

Scala

val df = Seq(
  (2023, 1, "Alice", 100),
  (2023, 1, "Bob", 200),
  (2023, 2, "Alice", 150),
  (2024, 1, "Alice", 300)
).toDF("year", "month", "name", "amount")

// Write partitioned by year and month
df.write.format("orc").partitionBy("year", "month").save("/tmp/partitioned_orc")

SQL

-- Create a partitioned ORC table
CREATE TABLE partitioned_orc_table (
  name STRING,
  amount INT
)
USING ORC
PARTITIONED BY (year INT, month INT);

Read ORC files using SQL

Use read_files to query ORC files directly from cloud storage using SQL without creating a table. For example, query an ORC file stored in cloud storage using the path to the file and the orc format specifier.

SELECT * FROM read_files(
  's3://<bucket>/<path>/<file>.orc',
  format => 'orc'
)

Set ORC compression

Configure ORC compression using the compression option. Supported codecs include none, snappy, zlib, and lzo. For example, write df to compressed_orc using zlib compression, or to snappy_orc using snappy compression.

Python

# Write with zlib compression
df.write.format("orc").option("compression", "zlib").save("/tmp/compressed_orc")

# Write with snappy compression (default)
df.write.format("orc").option("compression", "snappy").save("/tmp/snappy_orc")

Scala

// Write with zlib compression
df.write.format("orc").option("compression", "zlib").save("/tmp/compressed_orc")

// Write with snappy compression (default)
df.write.format("orc").option("compression", "snappy").save("/tmp/snappy_orc")

SQL

-- Create an ORC table with zlib compression
CREATE TABLE compressed_orc_table (
  name STRING,
  age INT,
  city STRING
)
USING ORC
TBLPROPERTIES ('orc.compress' = 'ZLIB');

-- Create an ORC table with snappy compression
CREATE TABLE snappy_orc_table (
  name STRING,
  age INT,
  city STRING
)
USING ORC
TBLPROPERTIES ('orc.compress' = 'SNAPPY');