Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read parquet files.
Syntax
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Properties
| Property | Description |
|---|---|
version |
The version of Spark on which this application is running. |
conf |
Runtime configuration interface for Spark. |
catalog |
Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc. |
udf |
Returns a UDFRegistration for UDF registration. |
udtf |
Returns a UDTFRegistration for UDTF registration. |
dataSource |
Returns a DataSourceRegistration for data source registration. |
profile |
Returns a Profile for performance/memory profiling. |
sparkContext |
Returns the underlying SparkContext. Classic mode only. |
read |
Returns a DataFrameReader that can be used to read data as a DataFrame. |
readStream |
Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame. |
streams |
Returns a StreamingQueryManager that allows managing all active streaming queries. |
tvf |
Returns a TableValuedFunction for calling table-valued functions (TVFs). |
Methods
| Method | Description |
|---|---|
createDataFrame(data, schema, samplingRatio, verifySchema) |
Creates a DataFrame from an RDD, a list, a pandas DataFrame, a numpy ndarray, or a pyarrow Table. |
sql(sqlQuery, args, **kwargs) |
Returns a DataFrame representing the result of the given query. |
table(tableName) |
Returns the specified table as a DataFrame. |
range(start, end, step, numPartitions) |
Creates a DataFrame with a single LongType column named id, containing elements in a range. |
newSession() |
Returns a new SparkSession with separate SQLConf, registered temporary views, and UDFs, but shared SparkContext and table cache. Classic mode only. |
getActiveSession() |
Returns the active SparkSession for the current thread. |
active() |
Returns the active or default SparkSession for the current thread. |
stop() |
Stops the underlying SparkContext. |
addArtifacts(*path, pyfile, archive, file) |
Adds artifact(s) to the client session. |
interruptAll() |
Interrupts all operations of this session currently running on the server. |
interruptTag(tag) |
Interrupts all operations of this session with the given tag. |
interruptOperation(op_id) |
Interrupts an operation of this session with the given operationId. |
addTag(tag) |
Adds a tag to be assigned to all operations started by this thread in this session. |
removeTag(tag) |
Removes a tag previously added for operations started by this thread. |
getTags() |
Gets the tags currently set to be assigned to all operations started by this thread. |
clearTags() |
Clears the current thread's operation tags. |
Builder
| Method | Description |
|---|---|
config(key, value) |
Sets a config option. Options are automatically propagated to both SparkConf and SparkSession's own configuration. |
master(master) |
Sets the Spark master URL to connect to. |
remote(url) |
Sets the Spark remote URL to connect via Spark Connect. |
appName(name) |
Sets a name for the application, which will be shown in the Spark web UI. |
enableHiveSupport() |
Enables Hive support, including connectivity to a persistent Hive metastore. |
getOrCreate() |
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder. |
create() |
Creates a new SparkSession. |
Examples
spark = (
SparkSession.builder
.master("local")
.appName("Word Count")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
spark.sql("SELECT * FROM range(10) where id > 7").show()
+---+
| id|
+---+
| 8|
| 9|
+---+
spark.createDataFrame([('Alice', 1)], ['name', 'age']).show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
spark.range(1, 7, 2).show()
+---+
| id|
+---+
| 1|
| 3|
| 5|
+---+