Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Functionality for statistical functions with a DataFrame.
Supports Spark Connect
Syntax
DataFrame.stat
Methods
| Method | Description |
|---|---|
approxQuantile(col, probabilities, relativeError) |
Calculates the approximate quantiles of numerical columns of a DataFrame. |
corr(col1, col2, method) |
Calculates the correlation of two columns as a double value. Currently only supports the Pearson Correlation Coefficient. |
cov(col1, col2) |
Calculates the sample covariance for the given columns as a double value. |
crosstab(col1, col2) |
Computes a pair-wise frequency table of the given columns. |
freqItems(cols, support) |
Finds frequent items for columns, possibly with false positives. |
sampleBy(col, fractions, seed) |
Returns a stratified sample without replacement based on the fraction given on each stratum. |
Examples
Approximate quantiles
data = [(1,), (2,), (3,), (4,), (5,)]
df = spark.createDataFrame(data, ["values"])
df.stat.approxQuantile("values", [0.0, 0.5, 1.0], 0.05)
[1.0, 3.0, 5.0]
Correlation
df = spark.createDataFrame([(1, 12), (10, 1), (19, 8)], ["c1", "c2"])
df.stat.corr("c1", "c2")
-0.3592106040535498
Covariance
df = spark.createDataFrame([(1, 12), (10, 1), (19, 8)], ["c1", "c2"])
df.stat.cov("c1", "c2")
-18.0
Cross tabulation
df = spark.createDataFrame([(1, 11), (1, 11), (3, 10), (4, 8), (4, 8)], ["c1", "c2"])
df.stat.crosstab("c1", "c2").sort("c1_c2").show()
+-----+---+---+---+
|c1_c2| 10| 11| 8|
+-----+---+---+---+
| 1| 0| 2| 0|
| 3| 1| 0| 0|
| 4| 0| 0| 2|
+-----+---+---+---+
Frequent items
from pyspark.sql import functions as sf
df = spark.createDataFrame([(1, 11), (1, 11), (3, 10), (4, 8), (4, 8)], ["c1", "c2"])
df2 = df.stat.freqItems(["c1", "c2"])
df2.select([sf.sort_array(c).alias(c) for c in df2.columns]).show()
+------------+------------+
|c1_freqItems|c2_freqItems|
+------------+------------+
| [1, 3, 4]| [8, 10, 11]|
+------------+------------+
Stratified sample
from pyspark.sql import functions as sf
dataset = spark.range(0, 100, 1, 5).select((sf.col("id") % 3).alias("key"))
dataset.stat.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=0).groupBy("key").count().orderBy("key").show()
+---+-----+
|key|count|
+---+-----+
| 0| 4|
| 1| 9|
+---+-----+