We have three weight matrices: the weights from input to hidden state; the weights from hidden state to hidden state; and the weights from hidden state to output. We’re going to update them by some flavor of gradient descent as usual.

The tricky thing to understand is how we get the gradient in the first place. The key is to realize that, once you’ve actually run an input through the RNN, there is a definite number of “layers.” If we assume that only the final hidden state is used for the output, then (in retrospect) the function as executed looks like a fully connected feedforward network, at least as far as backpropagation is concerned. We have a constraint that the weights must all match at the end of our parameter update, and they need to reflect the state of all of the layers.

The clues are in the constraint. If we need them to match, and we need them to reflect all the layers, we can just do the obvious and add up the gradient (change in error wrt inputs and outputs) at all the layers. And indeed, that is literally all there is to implementing backpropagation through time: We calculate the gradient of the loss from the prediction to the target at the output state, then backpropagate the gradient to the -th layer. We proceed as normal, calculating the gradient of the error with respect to the outputs and inputs of each layer. External to this, we accumulate the gradient value for each weight. We use this externally accumulated gradient to update the weights exactly as usual.