Pandas API on Spark

Note

This feature is available on clusters that run Databricks Runtime 10.0 (unsupported) and above. For clusters that run Databricks Runtime 9.1 LTS and below, use Koalas instead.

Commonly used by data scientists, pandas is a Python package that provides easy-to-use data structures and data analysis tools for the Python programming language. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas equivalent APIs that work on Apache Spark. Pandas API on Spark is useful not only for pandas users but also PySpark users, because pandas API on Spark supports many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame.

Requirements

Pandas API on Spark is available beginning in Apache Spark 3.2 (which is included beginning in Databricks Runtime 10.0 (unsupported)) by using the following import statement:

import pyspark.pandas as ps

Notebook

The following notebook shows how to migrate from pandas to pandas API on Spark.

pandas to pandas API on Spark notebook

Get notebook

Resources