Load data using Petastorm
This article describes how to use Petastorm convert data from Apache Spark to TensorFlow or PyTorch. It also provides an example showing how to use Petastorm to prepare data for ML.
Note
The petastorm
package is deprecated. Mosaic Streaming is the recommended replacement for loading large datasets from cloud storage.
Petastorm is an open source data access library. It enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. Petastorm supports popular Python-based machine learning (ML) frameworks such as TensorFlow, PyTorch, and PySpark. For more information about Petastorm, see the Petastorm API documentation.
The Petastorm Spark converter API simplifies data conversion from Spark to TensorFlow or PyTorch. The input Spark DataFrame is first materialized in Parquet format and then loaded as a tf.data.Dataset
or torch.utils.data.DataLoader
.
See the Spark Dataset Converter API section in the Petastorm API documentation.
The recommended workflow is:
- Use Apache Spark to load and optionally preprocess data.
- Use the Petastorm
spark_dataset_converter
method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader. - Feed data into a DL framework for training or inference.
The Petastorm Spark converter caches the input Spark DataFrame in Parquet format in a user-specified cache directory location. The cache directory must be a DBFS path starting with file:///dbfs/
, for example, file:///dbfs/tmp/foo/
which refers to the same location as dbfs:/tmp/foo/
. You can configure the cache directory in two ways:
In the cluster Spark config add the line:
petastorm.spark.converter.parentCacheDirUrl file:///dbfs/...
In your notebook, call
spark.conf.set()
:from petastorm.spark import SparkDatasetConverter, make_spark_converter spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///dbfs/...')
You can either explicitly delete the cache after using it by calling converter.delete()
or manage the cache implicitly by configuring the lifecycle rules in your object storage.
Databricks supports DL training in three scenarios:
- Single-node training
- Distributed hyperparameter tuning
- Distributed training
For end-to-end examples, see the following notebooks:
This method is less preferred than the Petastorm Spark converter API.
The recommended workflow is:
- Use Apache Spark to load and optionally preprocess data.
- Save data in Parquet format into a DBFS path that has a companion DBFS mount.
- Load data in Petastorm format via the DBFS mount point.
- Use data in a DL framework for training or inference.
See example notebook for an end-to-end example.
This example notebook demonstrates the following workflow on Databricks:
- Load data using Spark.
- Convert the Spark DataFrame to a TensorFlow Dataset using Petastorm.
- Feed the data into a single-node TensorFlow model for training.
- Feed the data into a distributed hyperparameter tuning function.
- Feed the data into a distributed TensorFlow model for training.
This example notebook demonstrates the following workflow on Databricks:
- Load data using Spark.
- Convert the Spark DataFrame to a PyTorch DataLoader using Petastorm.
- Feed the data into a single-node PyTorch model for training.
- Feed the data into a distributed hyperparameter tuning function.
- Feed the data into a distributed PyTorch model for training.
This example notebook shows you the following workflow on Databricks:
- Use Spark to load and preprocess data.
- Save data using Parquet under
dbfs:/ml
. - Load data using Petastorm via the optimized FUSE mount
file:/dbfs/ml
. - Feed data into a deep learning framework for training or inference.