Prepare data for distributed training

This article describes methods for preparing data for distributed training.

For very large datasets that do not fit in memory, use streaming approaches:

PyTorch IterableDataset for custom streaming logic.
Hugging Face datasets with streaming for datasets hosted on the Hub or in volumes.
Ray Data for distributed batch data processing.

TFRecord

You can also use TFRecord format as the data source for distributed deep learning. TFRecord format is a simple record-oriented binary format that many TensorFlow applications use for training data.

tf.data.TFRecordDataset is the TensorFlow dataset, which is comprised of records from TFRecords files. For more details about how to consume TFRecord data, see the TensorFlow guide Consuming TFRecord data.

The following articles describe and illustrate the recommended ways to save your data to TFRecord files and load TFRecord files:

Save Apache Spark DataFrames as TFRecord files

Feedback

Was this page helpful?

Last updated on 2026-06-01

Prepare data for distributed training

TFRecord

Feedback

Additional resources