Bemærk
Adgang til denne side kræver godkendelse. Du kan prøve at logge på eller ændre mapper.
Adgang til denne side kræver godkendelse. Du kan prøve at ændre mapper.
Computes aggregates and returns the result as a DataFrame.
The available aggregate functions can be:
- Built-in aggregation functions, such as
avg,max,min,sum,count. - Group aggregate pandas UDFs, created with
pyspark.sql.functions.pandas_udf.
Syntax
agg(*exprs)
Parameters
| Parameter | Type | Description |
|---|---|---|
exprs |
dict or Column | A dict mapping from column name (string) to aggregate functions (string), or a list of aggregate Column expressions. |
Returns
DataFrame
Notes
Built-in aggregation functions and group aggregate pandas UDFs cannot be mixed in a single call to this function.
When exprs is a single dict, the key is the column to perform aggregation on and the value is the aggregate function. When exprs is a list of Column expressions, each expression specifies an aggregation to compute.
Examples
import pandas as pd
from pyspark.sql import functions as sf
df = spark.createDataFrame(
[(2, "Alice"), (3, "Alice"), (5, "Bob"), (10, "Bob")], ["age", "name"])
# Group-by name, and count each group.
df.groupBy(df.name).agg({"*": "count"}).sort("name").show()
# +-----+--------+
# | name|count(1)|
# +-----+--------+
# |Alice| 2|
# | Bob| 2|
# +-----+--------+
# Group-by name, and calculate the minimum age.
df.groupBy(df.name).agg(sf.min(df.age)).sort("name").show()
# +-----+--------+
# | name|min(age)|
# +-----+--------+
# |Alice| 2|
# | Bob| 5|
# +-----+--------+
# Same as above but uses a pandas UDF.
from pyspark.sql.functions import pandas_udf
@pandas_udf('int')
def min_udf(v: pd.Series) -> int:
return v.min()
df.groupBy(df.name).agg(min_udf(df.age)).sort("name").show()
# +-----+------------+
# | name|min_udf(age)|
# +-----+------------+
# |Alice| 2|
# | Bob| 5|
# +-----+------------+