Maximizing Efficiency: Loading Tabular Data with PyTorch DataLoader and BatchSampler
Learn how to prepare dynamically shuffled micro-batches from DataFrames correctly

In the world of machine learning, dealing with data efficiently is crucial, especially in deep learning, where larger volumes of data and computational resources are required. While PyTorch is widely used for deep learning, limited information is available on handling tabular data.
Therefore, this article aims to provide best practices for working with tabular data in PyTorch, specifically focusing on creating dynamic data batches from DataFrames already loaded in memory. This is particularly useful for tasks such as time series forecasting or building autoencoders for dimensionality reduction, for example.
DataLoader and Dataset, TensorDataset
Let’s get straight to the point. Those who have had some experience with PyTorch would construct the Dataset/DataLoader in the following manner:
Alternatively, for those who prefer abstractions, the equivalent would be:
In the above methods, the data sampling is done individually and randomly, accessing samples index by index, and leaving the process of accumulating into batches to a collate function, which by default is torch.utils.data.default_collate().
BatchSampler
In the well-known use cases, accessing samples individually is usually okay. For example, when working with image classification, the DataLoader’s workers should individually read each image into memory and then pre-process it so that tensors can be grouped into batches afterward.
In cases where the data is already loaded into memory, it’s best to change the sampling method in the DataLoader. This can be achieved by passing a list of random indices as an argument to the Dataset’s __getitem__ method instead of individual indices. The following adjustments in the code can help accomplish this:
By doing this, we take advantage of the fact that all the data is already in memory to build the batches directly without needing a separate grouping step.
A Quick Benchmark
We can build a simple benchmark to compare the efficiency of each strategy:
On my machine, the results are:
- TensorDataset: ~3.07 seconds per epoch
- TensorDataset with BatchSampler: ~0.24 seconds per epoch
Conclusion
Using TensorDataset with BatchSampler resulted in a speedup of at least 10x. However, the efficiency may vary depending on the number of features, batch size, and hardware used.
It’s important to note that DataLoader can mask some inefficiencies by using prefetching, allowing its workers to proactively prepare some batches using the CPU even before that batch is requested during training.
In conclusion, while the bottleneck during training generally lies in the GPU, any optimization related to data preparation becomes valuable when training is scaled to terabytes of data and large neural networks.