DataFrameReader-Klasse

Schnittstelle zum Laden eines DataFrames aus externen Speichersystemen (z. B. Dateisysteme, Schlüsselwertspeicher usw.).

Unterstützt Spark Connect

Syntax

Wird SparkSession.read verwendet, um auf diese Schnittstelle zuzugreifen.

Methodik

Methode	Beschreibung
`format(source)`	Gibt das Format der Eingabedatenquelle an.
`schema(schema)`	Gibt das Eingabeschema an.
`option(key, value)`	Fügt eine Eingabeoption für die zugrunde liegende Datenquelle hinzu.
`options(**options)`	Fügt Eingabeoptionen für die zugrunde liegende Datenquelle hinzu.
`load(path, format, schema, **options)`	Lädt Daten aus einer Datenquelle und gibt sie als DataFrame zurück.
`json(path, schema, ...)`	Lädt JSON-Dateien und gibt die Ergebnisse als DataFrame zurück.
`table(tableName)`	Gibt die angegebene Tabelle als DataFrame zurück.
`parquet(paths, *options)`	Lädt Parkettdateien und gibt das Ergebnis als DataFrame zurück.
`text(paths, wholetext, lineSep, ...)`	Lädt Textdateien und gibt einen DataFrame zurück, dessen Schema mit einer Zeichenfolgenspalte mit dem Namen "value" beginnt.
`csv(path, schema, sep, encoding, ...)`	Lädt eine CSV-Datei und gibt das Ergebnis als DataFrame zurück.
`xml(path, rowTag, schema, ...)`	Lädt eine XML-Datei und gibt das Ergebnis als DataFrame zurück.
`excel(path, dataAddress, headerRows, ...)`	Lädt Excel-Dateien und gibt das Ergebnis als DataFrame zurück.
`orc(path, mergeSchema, pathGlobFilter, ...)`	Lädt ORC-Dateien und gibt das Ergebnis als DataFrame zurück.
`jdbc(url, table, column, lowerBound, upperBound, numPartitions, predicates, properties)`	Erstellen Sie einen DataFrame, der die Datenbanktabelle mit dem Namen der Tabelle darstellt, auf die über DIE URL-URL und Verbindungseigenschaften von URL UND VERBINDUNGSeigenschaften zugegriffen werden kann.

Beispiele

Lesen aus verschiedenen Datenquellen

# Access DataFrameReader through SparkSession
spark.read

# Read JSON file
df = spark.read.json("path/to/file.json")

# Read CSV file with options
df = spark.read.option("header", "true").csv("path/to/file.csv")

# Read Parquet file
df = spark.read.parquet("path/to/file.parquet")

# Read from a table
df = spark.read.table("table_name")

Verwenden von Format und Laden

# Specify format explicitly
df = spark.read.format("json").load("path/to/file.json")

# With options
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("path/to/file.csv")

Angeben des Schemas

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Read CSV with schema
df = spark.read.schema(schema).csv("path/to/file.csv")

# Read CSV with DDL-formatted string schema
df = spark.read.schema("name STRING, age INT").csv("path/to/file.csv")

Lesen von LISTEN

# Read from database table
df = spark.read.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    properties={"user": "myuser", "password": "mypassword"}
)

# Read with partitioning for parallel loading
df = spark.read.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    column="id",
    lowerBound=1,
    upperBound=1000,
    numPartitions=10,
    properties={"user": "myuser", "password": "mypassword"}
)

Methodenketten

# Chain multiple configuration methods
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("delimiter", ",") \
    .schema("name STRING, age INT") \
    .load("path/to/file.csv")

Feedback

War diese Seite hilfreich?

Last updated on 2026-03-15