Share via


DataFrame Class

Definition

A distributed collection of data organized into named columns.

public sealed class DataFrame
type DataFrame = class
Public NotInheritable Class DataFrame
Inheritance
DataFrame

Properties

Item[String]

Selects column based on the column name.

Methods

Agg(Column, Column[])

Aggregates on the entire DataFrame without groups.

Alias(String)

Returns a new DataFrame with an alias set. Same as As().

As(String)

Returns a new DataFrame with an alias set.

Cache()

Persist this DataFrame with the default storage level MEMORY_AND_DISK.

Checkpoint(Boolean)

Returns a checkpointed version of this DataFrame.

Coalesce(Int32)

Returns a new DataFrame that has exactly numPartitions partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions.

Col(String)

Selects column based on the column name.

Collect()

Returns an array that contains all rows in this DataFrame.

ColRegex(String)

Selects column based on the column name specified as a regex.

Columns()

Returns all column names.

Count()

Returns the number of rows in the DataFrame.

CreateGlobalTempView(String)

Creates a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.

CreateOrReplaceGlobalTempView(String)

Creates or replaces a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.

CreateOrReplaceTempView(String)

Creates or replaces a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that created this DataFrame.

CreateTempView(String)

Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that created this DataFrame.

CrossJoin(DataFrame)

Explicit Cartesian join with another DataFrame.

Cube(Column[])

Create a multi-dimensional cube for the current DataFrame using the specified columns.

Cube(String, String[])

Create a multi-dimensional cube for the current DataFrame using the specified columns.

Describe(String[])

Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.

Distinct()

Returns a new Dataset that contains only the unique rows from this DataFrame. This is an alias for DropDuplicates().

Drop(Column)

Returns a new DataFrame with a column dropped. This is a no-op if the DataFrame doesn't have a column with an equivalent expression.

Drop(String[])

Returns a new DataFrame with columns dropped. This is a no-op if schema doesn't contain column name(s).

DropDuplicates()

Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for Distinct().

DropDuplicates(String, String[])

Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

DTypes()

Returns all column names and their data types as an IEnumerable of Tuples.

Except(DataFrame)

Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame.

ExceptAll(DataFrame)

Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving the duplicates.

Explain(Boolean)

Prints the plans (logical and physical) to the console for debugging purposes.

Explain(String)

Prints the plans (logical and physical) with a format specified by a given explain mode.

Filter(Column)

Filters rows using the given condition.

Filter(String)

Filters rows using the given SQL expression.

First()

Returns the first row. Alis for Head().

GroupBy(Column[])

Groups the DataFrame using the specified columns, so we can run aggregation on them.

GroupBy(String, String[])

Groups the DataFrame using the specified columns.

Head()

Returns the first row.

Head(Int32)

Returns the first n rows.

Hint(String, Object[])

Specifies some hint on the current DataFrame.

Intersect(DataFrame)

Returns a new DataFrame containing rows only in both this DataFrame and another DataFrame.

IntersectAll(DataFrame)

Returns a new DataFrame containing rows only in both this DataFrame and another DataFrame while preserving the duplicates.

IsEmpty()

Returns true if this DataFrame is empty.

IsLocal()

Returns true if the Collect() and Take() methods can be run locally without any Spark executors.

IsStreaming()

Returns true if this DataFrame contains one or more sources that continuously return data as it arrives.

Join(DataFrame, Column, String)

Join with another DataFrame, using the given join expression.

Join(DataFrame, IEnumerable<String>, String)

Equi-join with another DataFrame using the given columns. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use the crossJoin method.

Join(DataFrame, String)

Inner equi-join with another DataFrame using the given column.

Join(DataFrame)

Join with another DataFrame.

Limit(Int32)

Returns a new DataFrame by taking the first number rows.

LocalCheckpoint(Boolean)

Returns a locally checkpointed version of this DataFrame.

Na()

Returns a DataFrameNaFunctions for working with missing data.

Observe(String, Column, Column[])

Define (named) metrics to observe on the Dataset. This method returns an 'observed' DataFrame that returns the same result as the input, with the following guarantees:

  1. It will compute the defined aggregates(metrics) on all the data that is flowing through the Dataset at that point.
  2. It will report the value of the defined aggregate columns as soon as we reach a completion point.A completion point is either the end of a query(batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.

Please note that continuous execution is currently not supported.

OrderBy(Column[])

Returns a new Dataset sorted by the given expressions.

OrderBy(String, String[])

Returns a new Dataset sorted by the given expressions.

Persist()

Persist this DataFrame with the default storage level MEMORY_AND_DISK.

Persist(StorageLevel)

Persist this DataFrame with the given storage level.

PrintSchema()

Prints the schema to the console in a nice tree format.

PrintSchema(Int32)

Prints the schema up to the given level to the console in a nice tree format.

RandomSplit(Double[], Nullable<Int64>)

Randomly splits this DataFrame with the provided weights.

Repartition(Column[])

Returns a new DataFrame partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions.

Repartition(Int32, Column[])

Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned.

Repartition(Int32)

Returns a new DataFrame that has exactly numPartitions partitions.

RepartitionByRange(Column[])

Returns a new DataFrame partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is range partitioned.

RepartitionByRange(Int32, Column[])

Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is range partitioned.

Rollup(Column[])

Create a multi-dimensional rollup for the current DataFrame using the specified columns.

Rollup(String, String[])

Create a multi-dimensional rollup for the current DataFrame using the specified columns.

Sample(Double, Boolean, Nullable<Int64>)

Returns a new DataFrame by sampling a fraction of rows (without replacement), using a user-supplied seed.

Schema()

Returns the schema associated with this DataFrame.

Select(Column[])

Selects a set of column based expressions.

Select(String, String[])

Selects a set of columns. This is a variant of Select() that can only select existing columns using column names (i.e. cannot construct expressions).

SelectExpr(String[])

Selects a set of SQL expressions. This is a variant of Select() that accepts SQL expressions.

Show(Int32, Int32, Boolean)

Displays rows of the DataFrame in tabular form.

Sort(Column[])

Returns a new DataFrame sorted by the given expressions.

Sort(String, String[])

Returns a new DataFrame sorted by the specified column, all in ascending order.

SortWithinPartitions(Column[])

Returns a new DataFrame with each partition sorted by the given expressions.

SortWithinPartitions(String, String[])

Returns a new DataFrame with each partition sorted by the given expressions.

Stat()

Returns a DataFrameStatFunctions for working statistic functions support.

StorageLevel()

Get the DataFrame's current StorageLevel().

Summary(String[])

Computes specified statistics for numeric and string columns.

Tail(Int32)

Returns the last n rows in the DataFrame.

Take(Int32)

Returns the first n rows in the DataFrame.

ToDF()

Converts this strongly typed collection of data to generic DataFrame.

ToDF(String[])

Converts this strongly typed collection of data to generic DataFrame with columns renamed.

ToJSON()

Returns the content of the DataFrame as a DataFrame of JSON strings.

ToLocalIterator()

Returns an iterator that contains all of the rows in this DataFrame. The iterator will consume as much memory as the largest partition in this DataFrame.

ToLocalIterator(Boolean)

Returns an iterator that contains all of the rows in this DataFrame. The iterator will consume as much memory as the largest partition in this DataFrame. With prefetch it may consume up to the memory of the 2 largest partitions.

Transform(Func<DataFrame,DataFrame>)

Concise syntax for chaining custom transformations.

Union(DataFrame)

Returns a new DataFrame containing union of rows in this DataFrame and another DataFrame.

UnionByName(DataFrame)

Returns a new DataFrame containing union of rows in this DataFrame and another DataFrame, resolving columns by name.

Unpersist(Boolean)

Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

Where(Column)

Filters rows using the given condition. This is an alias for Filter().

Where(String)

Filters rows using the given SQL expression. This is an alias for Filter().

WithColumn(String, Column)

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

WithColumnRenamed(String, String)

Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.

WithWatermark(String, String)

Defines an event time watermark for this DataFrame. A watermark tracks a point in time before which we assume no more late data is going to arrive.

Write()

Interface for saving the content of the non-streaming Dataset out into external storage.

WriteStream()

Interface for saving the content of the streaming Dataset out into external storage.

WriteTo(String)

Create a write configuration builder for v2 sources.

Applies to