I thought I had somehow forgotten fundamental ideas in vector calculus because I would see discussions about “gradients” being written with partial derivative notation. For example, the following appears in the Wikipedia page about residual networks (emphasis mine):
…the gradient computation of a shallower layer,
, always has a later term that is directly added. Even if the gradients of the terms are small, the total gradient resists vanishing thanks to the added term .
I double-checked, and I’m not going crazy. Partial derivatives are scalar-valued. The gradient of a function with respect to some vector (tensor) is the vector (tensor) of partial derivatives with respect to each component of the vector (tensor).
Unfortunately, this notational abuse was introduced at some point early in the history of neural networks, and it has persisted.