PyTorch claims that it is a “a replacement for NumPy to use the power of GPUs and other accelerators.” They certainly make a fuss about how most code that uses NumPy vectors can handle PyTorch tensors just fine.

So I kept looking at code samples like the following, and being really puzzled as to how optimizer.step() knew what changes to make:

nn_model = MyClass()
criterion = nn.MSELoss()
optimizer = optim.SGD(nn_model.parameters(), lr=0.2, momentum=0.9)

for epoch in range(num_epochs):
    y_pred = nn_model(X)
    loss = criterion(y_pred, y)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

A NumPy array is a value object: it stores state, and that’s about it. So it seemed like we should have to explicitly tell optimizer about loss (or vice versa) at some point, but we don’t. The closest thing to an interaction is that the optimizer knows about the model (line 3), and the loss uses a prediction from the model (line 7). So what’s happening?

Turns out that y_pred has a property, grad_fn, that links back to the last operation that altered this tensor. (PyTorch tensors are mutable.) This operation is used as part of a “tape” of operations that the tensor has undergone during the forward pass. This “tape” is really a DAG, such that a given tensor is connected (via interaction events) to all of the tensors that it has interacted with.

When we use loss.backward, we trigger all tensors that have interacted with it to update their gradients using this tape. This is called Autograd, and includes, by a chain of interactions, all the parameters of nn_model.

Claude AI suggested this example:

a = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
b = a * 2
c = b.mean()

The requires_grad argument to a endows it with an execution graph. Because b and c interact with a, they automatically get graphs as well.

It describes the state of these tensors’ graphs as follows:

  1. Tensor a:
    • a is a leaf tensor (created directly, not as a result of an operation).
    • It has requires_grad=True.
    • Its graph consists only of itself as it’s the starting point.
  2. Tensor b:
    • b is created from the operation a * 2.
    • Its graph includes:
      • The multiplication operation (represented by a MulBackward grad_fn)
      • A connection back to a
  3. Tensor c:
    • c is created from the operation b.mean().
    • Its graph includes:
      • The mean operation (represented by a MeanBackward grad_fn)
      • The entire graph of b, which in turn includes a

It also offered the following ASCII visualization:

a (leaf) <--- MulBackward <--- MeanBackward --- c
                |
                b

Now let’s add an interaction between a and c:

d = a + c  # New operation combining a and c

No cycle is formed, because they are only interacting via an operation. The graph now looks like:

a (leaf) <--- MulBackward <--- MeanBackward --- c
    ^            |               ^
    |            b               |
    |                            |
    +--------------------------- AddBackward --- d