Lưu ý
Cần có ủy quyền mới truy nhập được vào trang này. Bạn có thể thử đăng nhập hoặc thay đổi thư mục.
Cần có ủy quyền mới truy nhập được vào trang này. Bạn có thể thử thay đổi thư mục.
A distributed collection of data grouped into named columns.
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.
Important
A DataFrame should not be directly created using the constructor.
Supports Spark Connect
Properties
| Property | Description |
|---|---|
sparkSession |
Returns SparkSession that created this DataFrame. |
rdd |
Returns the content as an RDD of Row (Classic mode only). |
na |
Returns a DataFrameNaFunctions for handling missing values. |
stat |
Returns a DataFrameStatFunctions for statistic functions. |
write |
Interface for saving the content of the non-streaming DataFrame out into external storage. |
writeStream |
Interface for saving the content of the streaming DataFrame out into external storage. |
schema |
Returns the schema of this DataFrame as a StructType. |
dtypes |
Returns all column names and their data types as a list. |
columns |
Retrieves the names of all columns in the DataFrame as a list. |
storageLevel |
Get the DataFrame's current storage level. |
isStreaming |
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives. |
executionInfo |
Returns a ExecutionInfo object after the query was executed. |
plot |
Returns a PySparkPlotAccessor for plotting functions. |
Methods
Data viewing and inspection
| Method | Description |
|---|---|
toJSON(use_unicode) |
Converts a DataFrame into a RDD of string or DataFrame. |
printSchema(level) |
Prints out the schema in the tree format. |
explain(extended, mode) |
Prints the (logical and physical) plans to the console for debugging purposes. |
show(n, truncate, vertical) |
Prints the first n rows of the DataFrame to the console. |
collect() |
Returns all the records in the DataFrame as a list of Row. |
toLocalIterator(prefetchPartitions) |
Returns an iterator that contains all of the rows in this DataFrame. |
take(num) |
Returns the first num rows as a list of Row. |
tail(num) |
Returns the last num rows as a list of Row. |
head(n) |
Returns the first n rows. |
first() |
Returns the first row as a Row. |
count() |
Returns the number of rows in this DataFrame. |
isEmpty() |
Checks if the DataFrame is empty and returns a boolean value. |
describe(*cols) |
Computes basic statistics for numeric and string columns. |
summary(*statistics) |
Computes specified statistics for numeric and string columns. |
Temporary views
| Method | Description |
|---|---|
createTempView(name) |
Creates a local temporary view with this DataFrame. |
createOrReplaceTempView(name) |
Creates or replaces a local temporary view with this DataFrame. |
createGlobalTempView(name) |
Creates a global temporary view with this DataFrame. |
createOrReplaceGlobalTempView(name) |
Creates or replaces a global temporary view using the given name. |
Selection and projection
| Method | Description |
|---|---|
select(*cols) |
Projects a set of expressions and returns a new DataFrame. |
selectExpr(*expr) |
Projects a set of SQL expressions and returns a new DataFrame. |
filter(condition) |
Filters rows using the given condition. |
where(condition) |
Alias for filter. |
drop(*cols) |
Returns a new DataFrame without specified columns. |
toDF(*cols) |
Returns a new DataFrame with new specified column names. |
withColumn(colName, col) |
Returns a new DataFrame by adding a column or replacing the existing column that has the same name. |
withColumns(*colsMap) |
Returns a new DataFrame by adding multiple columns or replacing the existing columns that have the same names. |
withColumnRenamed(existing, new) |
Returns a new DataFrame by renaming an existing column. |
withColumnsRenamed(colsMap) |
Returns a new DataFrame by renaming multiple columns. |
withMetadata(columnName, metadata) |
Returns a new DataFrame by updating an existing column with metadata. |
metadataColumn(colName) |
Selects a metadata column based on its logical column name and returns it as a Column. |
colRegex(colName) |
Selects column based on the column name specified as a regex and returns it as Column. |
Sorting and ordering
| Method | Description |
|---|---|
sort(*cols, **kwargs) |
Returns a new DataFrame sorted by the specified column(s). |
orderBy(*cols, **kwargs) |
Alias for sort. |
sortWithinPartitions(*cols, **kwargs) |
Returns a new DataFrame with each partition sorted by the specified column(s). |
Aggregation and grouping
| Method | Description |
|---|---|
groupBy(*cols) |
Groups the DataFrame by the specified columns so that aggregation can be performed on them. |
rollup(*cols) |
Create a multi-dimensional rollup for the current DataFrame using the specified columns. |
cube(*cols) |
Create a multi-dimensional cube for the current DataFrame using the specified columns. |
groupingSets(groupingSets, *cols) |
Create multi-dimensional aggregation for the current DataFrame using the specified grouping sets. |
agg(*exprs) |
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()). |
observe(observation, *exprs) |
Define (named) metrics to observe on the DataFrame. |
Joins
| Method | Description |
|---|---|
join(other, on, how) |
Joins with another DataFrame, using the given join expression. |
crossJoin(other) |
Returns the cartesian product with another DataFrame. |
lateralJoin(other, on, how) |
Lateral joins with another DataFrame, using the given join expression. |
Set operations
| Method | Description |
|---|---|
union(other) |
Return a new DataFrame containing the union of rows in this and another DataFrame. |
unionByName(other, allowMissingColumns) |
Returns a new DataFrame containing union of rows in this and another DataFrame. |
intersect(other) |
Return a new DataFrame containing rows only in both this DataFrame and another DataFrame. |
intersectAll(other) |
Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates. |
subtract(other) |
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame. |
exceptAll(other) |
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates. |
Deduplication
| Method | Description |
|---|---|
distinct() |
Returns a new DataFrame containing the distinct rows in this DataFrame. |
dropDuplicates(subset) |
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. |
dropDuplicatesWithinWatermark(subset) |
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns, within watermark. |
Sampling and splitting
| Method | Description |
|---|---|
sample(withReplacement, fraction, seed) |
Returns a sampled subset of this DataFrame. |
sampleBy(col, fractions, seed) |
Returns a stratified sample without replacement based on the fraction given on each stratum. |
randomSplit(weights, seed) |
Randomly splits this DataFrame with the provided weights. |
Partitioning
| Method | Description |
|---|---|
coalesce(numPartitions) |
Returns a new DataFrame that has exactly numPartitions partitions. |
repartition(numPartitions, *cols) |
Returns a new DataFrame partitioned by the given partitioning expressions. |
repartitionByRange(numPartitions, *cols) |
Returns a new DataFrame partitioned by the given partitioning expressions. |
repartitionById(numPartitions, partitionIdCol) |
Returns a new DataFrame partitioned by the given partition ID expression. |
Reshaping
| Method | Description |
|---|---|
unpivot(ids, values, variableColumnName, valueColumnName) |
Unpivot a DataFrame from wide format to long format. |
melt(ids, values, variableColumnName, valueColumnName) |
Alias for unpivot. |
transpose(indexColumn) |
Transposes a DataFrame such that the values in the specified index column become the new columns. |
Missing data handling
| Method | Description |
|---|---|
dropna(how, thresh, subset) |
Returns a new DataFrame omitting rows with null or NaN values. |
fillna(value, subset) |
Returns a new DataFrame which null values are filled with new value. |
replace(to_replace, value, subset) |
Returns a new DataFrame replacing a value with another value. |
Statistical functions
| Method | Description |
|---|---|
approxQuantile(col, probabilities, relativeError) |
Calculates the approximate quantiles of numerical columns of a DataFrame. |
corr(col1, col2, method) |
Calculates the correlation of two columns of a DataFrame as a double value. |
cov(col1, col2) |
Calculate the sample covariance for the given columns, specified by their names. |
crosstab(col1, col2) |
Computes a pair-wise frequency table of the given columns. |
freqItems(cols, support) |
Finding frequent items for columns, possibly with false positives. |
Schema operations
| Method | Description |
|---|---|
to(schema) |
Returns a new DataFrame where each row is reconciled to match the specified schema. |
alias(alias) |
Returns a new DataFrame with an alias set. |
Iteration
| Method | Description |
|---|---|
foreach(f) |
Applies the f function to all Row of this DataFrame. |
foreachPartition(f) |
Applies the f function to each partition of this DataFrame. |
Caching and persistence
| Method | Description |
|---|---|
cache() |
Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER). |
persist(storageLevel) |
Sets the storage level to persist the contents of the DataFrame across operations. |
unpersist(blocking) |
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
Checkpointing
| Method | Description |
|---|---|
checkpoint(eager) |
Returns a checkpointed version of this DataFrame. |
localCheckpoint(eager, storageLevel) |
Returns a locally checkpointed version of this DataFrame. |
Streaming operations
| Method | Description |
|---|---|
withWatermark(eventTime, delayThreshold) |
Defines an event time watermark for this DataFrame. |
Optimization hints
| Method | Description |
|---|---|
hint(name, *parameters) |
Specifies some hint on the current DataFrame. |
Limits and offsets
| Method | Description |
|---|---|
limit(num) |
Limits the result count to the number specified. |
offset(num) |
Returns a new DataFrame by skipping the first n rows. |
Advanced transformations
| Method | Description |
|---|---|
transform(func, *args, **kwargs) |
Returns a new DataFrame. Concise syntax for chaining custom transformations. |
Conversion methods
| Method | Description |
|---|---|
toPandas() |
Returns the contents of this DataFrame as Pandas pandas.DataFrame. |
toArrow() |
Returns the contents of this DataFrame as PyArrow pyarrow.Table. |
pandas_api(index_col) |
Converts the existing DataFrame into a pandas-on-Spark DataFrame. |
mapInPandas(func, schema, barrier, profile) |
Maps an iterator of batches in the current DataFrame using a Python native function. |
mapInArrow(func, schema, barrier, profile) |
Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatch. |
Writing data
| Method | Description |
|---|---|
writeTo(table) |
Create a write configuration builder for v2 sources. |
mergeInto(table, condition) |
Merges a set of updates, insertions, and deletions based on a source table into a target table. |
DataFrame comparison
| Method | Description |
|---|---|
sameSemantics(other) |
Returns True when the logical query plans inside both DataFrames are equal. |
semanticHash() |
Returns a hash code of the logical query plan against this DataFrame. |
Metadata and file information
| Method | Description |
|---|---|
inputFiles() |
Returns a best-effort snapshot of the files that compose this DataFrame. |
Advanced SQL features
| Method | Description |
|---|---|
isLocal() |
Returns True if the collect and take methods can be run locally. |
asTable() |
Converts the DataFrame into a TableArg object, which can be used as a table argument in a TVF. |
scalar() |
Return a Column object for a SCALAR Subquery containing exactly one row and one column. |
exists() |
Return a Column object for an EXISTS Subquery. |
Examples
Basic DataFrame operations
# Create a DataFrame
people = spark.createDataFrame([
{"deptId": 1, "age": 40, "name": "Alice", "gender": "M", "salary": 50},
{"deptId": 1, "age": 50, "name": "Bob", "gender": "M", "salary": 100},
{"deptId": 2, "age": 60, "name": "Sue", "gender": "F", "salary": 150},
{"deptId": 3, "age": 20, "name": "Tom", "gender": "M", "salary": 200}
])
# Select columns
people.select("name", "age").show()
# Filter rows
people.filter(people.age > 30).show()
# Add a new column
people.withColumn("age_plus_10", people.age + 10).show()
Aggregation and grouping
# Group by and aggregate
people.groupBy("gender").agg({"salary": "avg", "age": "max"}).show()
# Multiple aggregations
from pyspark.sql import functions as F
people.groupBy("deptId").agg(
F.avg("salary").alias("avg_salary"),
F.max("age").alias("max_age")
).show()
Joins
# Create another DataFrame
department = spark.createDataFrame([
{"id": 1, "name": "PySpark"},
{"id": 2, "name": "ML"},
{"id": 3, "name": "Spark SQL"}
])
# Join DataFrames
people.join(department, people.deptId == department.id).show()
Complex transformations
# Chained operations
result = people.filter(people.age > 30) \\
.join(department, people.deptId == department.id) \\
.groupBy(department.name, "gender") \\
.agg({"salary": "avg", "age": "max"}) \\
.sort("max(age)")
result.show()