DataFrameReader 类

用于从外部存储系统（例如文件系统、键值存储等）加载数据帧的接口。

支持 Spark Connect

Syntax

用于 SparkSession.read 访问此接口。

方法

方法	说明
`format(source)`	指定输入数据源格式。
`schema(schema)`	指定输入架构。
`option(key, value)`	添加基础数据源的输入选项。
`options(**options)`	添加基础数据源的输入选项。
`load(path, format, schema, **options)`	从数据源加载数据，并将其作为数据帧返回。
`json(path, schema, ...)`	加载 JSON 文件并将结果作为数据帧返回。
`table(tableName)`	以数据帧的形式返回指定的表。
`parquet(paths, *options)`	加载 Parquet 文件，以数据帧的形式返回结果。
`text(paths, wholetext, lineSep, ...)`	加载文本文件并返回一个数据帧，其架构以名为“value”的字符串列开头。
`csv(path, schema, sep, encoding, ...)`	加载 CSV 文件，并将结果作为数据帧返回。
`xml(path, rowTag, schema, ...)`	加载 XML 文件并将结果作为数据帧返回。
`excel(path, dataAddress, headerRows, ...)`	加载 Excel 文件，以数据帧的形式返回结果。
`orc(path, mergeSchema, pathGlobFilter, ...)`	加载 ORC 文件，以数据帧的形式返回结果。
`jdbc(url, table, column, lowerBound, upperBound, numPartitions, predicates, properties)`	构造一个数据帧，表示可通过 JDBC URL URL 和连接属性访问的数据库表命名表。

示例

从不同的数据源读取

# Access DataFrameReader through SparkSession
spark.read

# Read JSON file
df = spark.read.json("path/to/file.json")

# Read CSV file with options
df = spark.read.option("header", "true").csv("path/to/file.csv")

# Read Parquet file
df = spark.read.parquet("path/to/file.parquet")

# Read from a table
df = spark.read.table("table_name")

使用格式和加载

# Specify format explicitly
df = spark.read.format("json").load("path/to/file.json")

# With options
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("path/to/file.csv")

指定架构

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Read CSV with schema
df = spark.read.schema(schema).csv("path/to/file.csv")

# Read CSV with DDL-formatted string schema
df = spark.read.schema("name STRING, age INT").csv("path/to/file.csv")

从 JDBC 读取

# Read from database table
df = spark.read.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    properties={"user": "myuser", "password": "mypassword"}
)

# Read with partitioning for parallel loading
df = spark.read.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    column="id",
    lowerBound=1,
    upperBound=1000,
    numPartitions=10,
    properties={"user": "myuser", "password": "mypassword"}
)

方法链接

# Chain multiple configuration methods
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("delimiter", ",") \
    .schema("name STRING, age INT") \
    .load("path/to/file.csv")

反馈

此页面是否有帮助？

Last updated on 2026-03-15

通过

DataFrameReader 类

Syntax

方法

示例

从不同的数据源读取

使用格式和加载

指定架构

从 JDBC 读取

方法链接

反馈

其他资源