Causal language modeling (CLM) refers to any language modeling process that predicts future tokens based only on the tokens that precede them. As far as I know, the only learning objective associated with CLM is next-word prediction. There are two classes of models capable of CLM: autoregressive recurrent networks and transformer models based only on the decoder stack, such as the GPT series.
-
Recurrent networks: Out of the box, a recurrent network will tend to use the entirety of the input sequence to predict subsequent positions. To achieve true causal modeling, it is necessary to provide the network with only one token at a time and training it to predict the next one, called teacher forcing.
-
Transformer decoders: thanks to masked self-attention, decoder-only transformers are causal by default.