Share via


SparkSession

The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read parquet files.

Syntax

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Properties

Property Description
version The version of Spark on which this application is running.
conf Runtime configuration interface for Spark.
catalog Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
udf Returns a UDFRegistration for UDF registration.
udtf Returns a UDTFRegistration for UDTF registration.
dataSource Returns a DataSourceRegistration for data source registration.
profile Returns a Profile for performance/memory profiling.
sparkContext Returns the underlying SparkContext. Classic mode only.
read Returns a DataFrameReader that can be used to read data as a DataFrame.
readStream Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.
streams Returns a StreamingQueryManager that allows managing all active streaming queries.
tvf Returns a TableValuedFunction for calling table-valued functions (TVFs).

Methods

Method Description
createDataFrame(data, schema, samplingRatio, verifySchema) Creates a DataFrame from an RDD, a list, a pandas DataFrame, a numpy ndarray, or a pyarrow Table.
sql(sqlQuery, args, **kwargs) Returns a DataFrame representing the result of the given query.
table(tableName) Returns the specified table as a DataFrame.
range(start, end, step, numPartitions) Creates a DataFrame with a single LongType column named id, containing elements in a range.
newSession() Returns a new SparkSession with separate SQLConf, registered temporary views, and UDFs, but shared SparkContext and table cache. Classic mode only.
getActiveSession() Returns the active SparkSession for the current thread.
active() Returns the active or default SparkSession for the current thread.
stop() Stops the underlying SparkContext.
addArtifacts(*path, pyfile, archive, file) Adds artifact(s) to the client session.
interruptAll() Interrupts all operations of this session currently running on the server.
interruptTag(tag) Interrupts all operations of this session with the given tag.
interruptOperation(op_id) Interrupts an operation of this session with the given operationId.
addTag(tag) Adds a tag to be assigned to all operations started by this thread in this session.
removeTag(tag) Removes a tag previously added for operations started by this thread.
getTags() Gets the tags currently set to be assigned to all operations started by this thread.
clearTags() Clears the current thread's operation tags.

Builder

Method Description
config(key, value) Sets a config option. Options are automatically propagated to both SparkConf and SparkSession's own configuration.
master(master) Sets the Spark master URL to connect to.
remote(url) Sets the Spark remote URL to connect via Spark Connect.
appName(name) Sets a name for the application, which will be shown in the Spark web UI.
enableHiveSupport() Enables Hive support, including connectivity to a persistent Hive metastore.
getOrCreate() Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
create() Creates a new SparkSession.

Examples

spark = (
    SparkSession.builder
        .master("local")
        .appName("Word Count")
        .config("spark.some.config.option", "some-value")
        .getOrCreate()
)
spark.sql("SELECT * FROM range(10) where id > 7").show()
+---+
| id|
+---+
|  8|
|  9|
+---+
spark.createDataFrame([('Alice', 1)], ['name', 'age']).show()
+-----+---+
| name|age|
+-----+---+
|Alice|  1|
+-----+---+
spark.range(1, 7, 2).show()
+---+
| id|
+---+
|  1|
|  3|
|  5|
+---+