PySpark reference

This page provides an overview of reference available for PySpark, a Python API for Spark. For more information about PySpark, see PySpark on Azure Databricks.

Reference	Description
Core Classes	Main classes for working with PySpark SQL, including SparkSession and DataFrame fundamentals.
Spark Session	The entry point for reading data and executing SQL queries in PySpark applications.
Configuration	Runtime configuration options for Spark SQL, including execution and optimizer settings. For information on configuration that is only available on Databricks, see Set Spark configuration properties on Azure Databricks.
DataFrame	Distributed collection of data organized into named columns, similar to a table in a relational database.
Input/Output	Methods for reading data from and writing data to various file formats and data sources.
Column	Operations for working with DataFrame columns, including transformations and expressions.
Data Types	Available data types in PySpark SQL, including primitive types, complex types, and user-defined types.
Row	Represents a row of data in a DataFrame, providing access to individual field values.
Functions	Built-in functions for data manipulation, transformation, and aggregation operations.
Window	Window functions for performing calculations across a set of table rows related to the current row.
Grouping	Methods for grouping data and performing aggregation operations on grouped DataFrames.
Catalog	Interface for managing databases, tables, functions, and other catalog metadata.
Avro	Support for reading and writing data in Apache Avro format.
Observation	Collects metrics and observes DataFrames during query execution for monitoring and debugging.
UDF	User-defined functions for applying custom Python logic to DataFrame columns.
UDTF	User-defined table functions that return multiple rows for each input row.
VariantVal	Handles semi-structured data with flexible schema, supporting dynamic types and nested structures.
ProtoBuf	Support for serializing and deserializing data using Protocol Buffers format.
Python DataSource	APIs for implementing custom data sources to read from external systems. For information about custom data sources, see PySpark custom data sources.
Stateful Processor	Manages state across streaming batches for complex stateful operations in structured streaming.

คำติชม

หน้านี้มีประโยชน์หรือไม่

Last updated on 2026-01-16

แชร์ผ่าน

PySpark reference

คำติชม

แหล่งทรัพยากรเพิ่มเติม