Pandas API on Spark
Commonly used by data scientists, pandas is a Python package that provides easy-to-use data structures and data analysis tools for the Python programming language. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas equivalent APIs that work on Apache Spark. Pandas API on Spark is useful not only for pandas users but also PySpark users, because pandas API on Spark supports many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame.
Pandas API on Spark is available beginning in Apache Spark 3.2 (which is included beginning in Databricks Runtime 10.0 (Unsupported)) by using the following
import pyspark.pandas as ps
The following notebook shows how to migrate from pandas to pandas API on Spark.
pandas to pandas API on Spark notebook
Submit and view feedback for