Share via


DataFrameReader class

Interface used to load a DataFrame from external storage systems (e.g. file systems, key-value stores, etc).

Supports Spark Connect

Syntax

Use SparkSession.read to access this interface.

Methods

Method Description
format(source) Specifies the input data source format.
schema(schema) Specifies the input schema.
option(key, value) Adds an input option for the underlying data source.
options(**options) Adds input options for the underlying data source.
load(path, format, schema, **options) Loads data from a data source and returns it as a DataFrame.
json(path, schema, ...) Loads JSON files and returns the results as a DataFrame.
table(tableName) Returns the specified table as a DataFrame.
parquet(*paths, **options) Loads Parquet files, returning the result as a DataFrame.
text(paths, wholetext, lineSep, ...) Loads text files and returns a DataFrame whose schema starts with a string column named "value".
csv(path, schema, sep, encoding, ...) Loads a CSV file and returns the result as a DataFrame.
xml(path, rowTag, schema, ...) Loads a XML file and returns the result as a DataFrame.
excel(path, dataAddress, headerRows, ...) Loads Excel files, returning the result as a DataFrame.
orc(path, mergeSchema, pathGlobFilter, ...) Loads ORC files, returning the result as a DataFrame.
jdbc(url, table, column, lowerBound, upperBound, numPartitions, predicates, properties) Construct a DataFrame representing the database table named table accessible via JDBC URL url and connection properties.

Examples

Reading from different data sources

# Access DataFrameReader through SparkSession
spark.read

# Read JSON file
df = spark.read.json("path/to/file.json")

# Read CSV file with options
df = spark.read.option("header", "true").csv("path/to/file.csv")

# Read Parquet file
df = spark.read.parquet("path/to/file.parquet")

# Read from a table
df = spark.read.table("table_name")

Using format and load

# Specify format explicitly
df = spark.read.format("json").load("path/to/file.json")

# With options
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("path/to/file.csv")

Specifying schema

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Read CSV with schema
df = spark.read.schema(schema).csv("path/to/file.csv")

# Read CSV with DDL-formatted string schema
df = spark.read.schema("name STRING, age INT").csv("path/to/file.csv")

Reading from JDBC

# Read from database table
df = spark.read.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    properties={"user": "myuser", "password": "mypassword"}
)

# Read with partitioning for parallel loading
df = spark.read.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    column="id",
    lowerBound=1,
    upperBound=1000,
    numPartitions=10,
    properties={"user": "myuser", "password": "mypassword"}
)

Method chaining

# Chain multiple configuration methods
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("delimiter", ",") \
    .schema("name STRING, age INT") \
    .load("path/to/file.csv")