Piezīmes
Lai piekļūtu šai lapai, ir nepieciešama autorizācija. Varat mēģināt pierakstīties vai mainīt direktorijus.
Lai piekļūtu šai lapai, ir nepieciešama autorizācija. Varat mēģināt mainīt direktorijus.
Generates session window given a timestamp specifying column.
Session window is one of dynamic windows, which means the length of window is varying according to the given inputs. The length of session window is defined as "the timestamp of latest input of the session + gap duration", so when the new inputs are bound to the current session window, the end time of session window can be expanded according to the new inputs.
Windows can support microsecond precision. Windows in the order of months are not supported.
For a streaming query, you may use the function current_timestamp to generate windows on processing time.
gapDuration is provided as strings, e.g. '1 second', '1 day 12 hours', '2 minutes'. Valid interval strings are 'week', 'day', 'hour', 'minute', 'second', 'millisecond', 'microsecond'.
It could also be a Column which can be evaluated to gap duration dynamically based on the input row.
The output column will be a struct called 'session_window' by default with the nested columns
'start' and 'end', where 'start' and 'end' will be of pyspark.sql.types.TimestampType.
For the corresponding Databricks SQL function, see session_window grouping expression.
Syntax
from pyspark.databricks.sql import functions as dbf
dbf.session_window(timeColumn=<timeColumn>, gapDuration=<gapDuration>)
Parameters
| Parameter | Type | Description |
|---|---|---|
timeColumn |
pyspark.sql.Column or str |
The column name or column to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType. |
gapDuration |
pyspark.sql.Column or literal string |
A Python string literal or column specifying the timeout of the session. It could be static value, e.g. 10 minutes, 1 second, or an expression/UDF that specifies gap duration dynamically based on the input row. |
Returns
pyspark.sql.Column: the column for computed results.
Examples
from pyspark.databricks.sql import functions as dbf
df = spark.createDataFrame([('2016-03-11 09:00:07', 1)], ['dt', 'v'])
df2 = df.groupBy(dbf.session_window('dt', '5 seconds')).agg(dbf.sum('v'))
df2.show(truncate=False)
df2.printSchema()