Maximizing Efficiency: Loading Tabular Data with PyTorch DataLoader and BatchSampler

Gabriel Tardochi Salles
2 min readJun 30, 2024

--

Learn how to prepare dynamically shuffled micro-batches from DataFrames correctly

Photo by Zeynep Sümer on Unsplash

In the world of machine learning, dealing with data efficiently is crucial, especially in deep learning, where larger volumes of data and computational resources are required. While PyTorch is widely used for deep learning, limited information is available on handling tabular data.

Therefore, this article aims to provide best practices for working with tabular data in PyTorch, specifically focusing on creating dynamic data batches from DataFrames already loaded in memory. This is particularly useful for tasks such as time series forecasting or building autoencoders for dimensionality reduction, for example.

DataLoader and Dataset, TensorDataset

Let’s get straight to the point. Those who have had some experience with PyTorch would construct the Dataset/DataLoader in the following manner:

Torch dataloading with Dataset and DataLoader. Code by author.

Alternatively, for those who prefer abstractions, the equivalent would be:

Torch dataloading with TensorDataset and DataLoader. Code by author.

In the above methods, the data sampling is done individually and randomly, accessing samples index by index, and leaving the process of accumulating into batches to a collate function, which by default is torch.utils.data.default_collate().

BatchSampler

In the well-known use cases, accessing samples individually is usually okay. For example, when working with image classification, the DataLoader’s workers should individually read each image into memory and then pre-process it so that tensors can be grouped into batches afterward.

In cases where the data is already loaded into memory, it’s best to change the sampling method in the DataLoader. This can be achieved by passing a list of random indices as an argument to the Dataset’s __getitem__ method instead of individual indices. The following adjustments in the code can help accomplish this:

Torch dataloading with TensorDataset, DataLoader, and BatchSampler with a RandomSampler. Code by author.

By doing this, we take advantage of the fact that all the data is already in memory to build the batches directly without needing a separate grouping step.

A Quick Benchmark

We can build a simple benchmark to compare the efficiency of each strategy:

On my machine, the results are:

  • TensorDataset: ~3.07 seconds per epoch
  • TensorDataset with BatchSampler: ~0.24 seconds per epoch

Conclusion

Using TensorDataset with BatchSampler resulted in a speedup of at least 10x. However, the efficiency may vary depending on the number of features, batch size, and hardware used.

It’s important to note that DataLoader can mask some inefficiencies by using prefetching, allowing its workers to proactively prepare some batches using the CPU even before that batch is requested during training.

In conclusion, while the bottleneck during training generally lies in the GPU, any optimization related to data preparation becomes valuable when training is scaled to terabytes of data and large neural networks.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Gabriel Tardochi Salles
Gabriel Tardochi Salles

Written by Gabriel Tardochi Salles

ML Engineer sharing practical insights and tutorials on data and AI. LinkedIn: www.linkedin.com/in/gabrieltardochisalles

No responses yet

Write a response