01 Data import and preprocessing (PyTorch)

In our example script, we have the following code for importing and preparing our dataset:

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

batch_size = 64

transform = transforms.Compose([
	transforms.Pad(2),
	transforms.ToTensor(),
])

training_data = datasets.MNIST(
    root="mnist",
    train=True,
    download=True,
	transform=transform,
)

test_data = datasets.MNIST(
    root="mnist",
    train=False,
    download=True,
    transform=transform,
)

train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

The two key classes are Dataset and DataLoader. The former is essentially just an interface providing low-level access to the dataset. DataLoader is a higher-level collection object that enables PyTorch to bring many powerful abstractions to bear in working with the data.

The highly regrettable argument names for datasets.MNIST ultimately just say that we should download the data and then use our transform object. It, in turn, pads each PIL image and then turns it into a PyTorch tensor.

In particular, the train argument has nothing to do with actually doing training; it specifies whether the pre-designated training dataset should be downloaded instead of the pre-designated test data.

Notice that you must choose your batch size at the time that you create your DataLoader.

David's raw ML reference notes

Explorer

01 Data import and preprocessing (PyTorch)

Graph View

Backlinks