• The batch size
  • The number of attention heads
  • The length of the sequence
  • The dimension of the inputs and outputs to each block in the model
  • The context vector length for each attention head