Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Returns the first value in a group. The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
Syntax
from pyspark.sql import functions as sf
sf.first(col, ignorenulls=False)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or column name |
Column to fetch first value for. |
ignorenulls |
bool | If first value is null then look for first non-null value. False by default. |
Returns
pyspark.sql.Column: first value of the group.
Examples
from pyspark.sql import functions as sf
df = spark.createDataFrame([("Alice", 2), ("Bob", 5), ("Alice", None)], ("name", "age"))
df = df.orderBy(df.age)
df.groupby("name").agg(sf.first("age")).orderBy("name").show()
+-----+----------+
| name|first(age)|
+-----+----------+
|Alice| NULL|
| Bob| 5|
+-----+----------+
To ignore any null values, set ignorenulls to True:
df.groupby("name").agg(sf.first("age", ignorenulls=True)).orderBy("name").show()
+-----+----------+
| name|first(age)|
+-----+----------+
|Alice| 2|
| Bob| 5|
+-----+----------+