We have three weight matrices: the weights
The tricky thing to understand is how we get the gradient in the first place. The key is to realize that, once you’ve actually run an input through the RNN, there is a definite number of “layers.” If we assume that only the final hidden state is used for the output, then (in retrospect) the function as executed looks like a fully connected feedforward network, at least as far as backpropagation is concerned. We have a constraint that the weights must all match at the end of our parameter update, and they need to reflect the state of all of the layers.
The clues are in the constraint. If we need them to match, and we need them to reflect all the layers, we can just do the obvious and add up the gradient (change in error wrt inputs and outputs) at all the layers. And indeed, that is literally all there is to implementing backpropagation through time: We calculate the gradient of the loss from the prediction to the target at the output state, then backpropagate the gradient to the