Σημείωση
Η πρόσβαση σε αυτή τη σελίδα απαιτεί εξουσιοδότηση. Μπορείτε να δοκιμάσετε να συνδεθείτε ή να αλλάξετε καταλόγους.
Η πρόσβαση σε αυτή τη σελίδα απαιτεί εξουσιοδότηση. Μπορείτε να δοκιμάσετε να αλλάξετε καταλόγους.
Returns the value from the col parameter that is associated with the maximum value from the ord parameter. This function is often used to find the col parameter value corresponding to the maximum ord parameter value within each group when used with groupBy(). The function is non-deterministic so the output order can be different for those associated the same values of col.
Syntax
from pyspark.sql import functions as sf
sf.max_by(col, ord)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or column name |
The column representing the values to be returned. This could be the column instance or the column name as string. |
ord |
pyspark.sql.Column or column name |
The column that needs to be maximized. This could be the column instance or the column name as string. |
Returns
pyspark.sql.Column: A column object representing the value from col that is associated with the maximum value from ord.
Examples
Example 1: Using max_by with groupBy
import pyspark.sql.functions as sf
df = spark.createDataFrame([
("Java", 2012, 20000), ("dotNET", 2012, 5000),
("dotNET", 2013, 48000), ("Java", 2013, 30000)],
schema=("course", "year", "earnings"))
df.groupby("course").agg(sf.max_by("year", "earnings")).sort("course").show()
+------+----------------------+
|course|max_by(year, earnings)|
+------+----------------------+
| Java| 2013|
|dotNET| 2013|
+------+----------------------+
Example 2: Using max_by with different data types
import pyspark.sql.functions as sf
df = spark.createDataFrame([
("Marketing", "Anna", 4), ("IT", "Bob", 2),
("IT", "Charlie", 3), ("Marketing", "David", 1)],
schema=("department", "name", "years_in_dept"))
df.groupby("department").agg(
sf.max_by("name", "years_in_dept")
).sort("department").show()
+----------+---------------------------+
|department|max_by(name, years_in_dept)|
+----------+---------------------------+
| IT| Charlie|
| Marketing| Anna|
+----------+---------------------------+
Example 3: Using max_by where ord has multiple maximum values
import pyspark.sql.functions as sf
df = spark.createDataFrame([
("Consult", "Eva", 6), ("Finance", "Frank", 5),
("Finance", "George", 9), ("Consult", "Henry", 7)],
schema=("department", "name", "years_in_dept"))
df.groupby("department").agg(
sf.max_by("name", "years_in_dept")
).sort("department").show()
+----------+---------------------------+
|department|max_by(name, years_in_dept)|
+----------+---------------------------+
| Consult| Henry|
| Finance| George|
+----------+---------------------------+