Dataset (docs, code) provides a Python iterator for samples and their labels. DataLoader (docs, code) with support for various is a specialized collection for various DL-related capabilities.

Dataset

You can implement a Dataset by subclassing it and implementing __getitem__ and __len__. It gives you a concatenation implementation for free.

In general, though, PyTorch will be iterating through your Dataset to create a DataLoader, discussed below. So it’s more efficient to also supply __iter__ and __getitems__.

If you are using a distributed training environment, you pretty much need to supply __iter__ so that you can ensure that the data is distributed properly. This is discussed in the source code linked above.

When practicing or prototyping, it’s often convenient to use pre-built datasets. PyTorch supplies Dataset objects for many standard vision, NLP, and audio tasks.

DataLoader

DataLoader is a specialized Python collection. It’s a fairly heavy class, providing a lot of functionality for batching, shuffling, buffering, multiprocessing, and many other pieces that let PyTorch operate at scale. It expects a Dataset as its first argument, though thanks to duck typing, you can get away with just passing in any old Iterable most of the time.

David's raw ML reference notes

Explorer

Dataset and DataLoader

Dataset

DataLoader

Graph View

Table of Contents

Backlinks