ELMo, BERT, and GPT are all major language models. In their time, they were all considered large language models because they involved many more parameters than, say, the first attention model. However, compared to modern LLMs, they’re all relatively small.

Of these, ELMo and BERT are the most directly comparable, as both are bi-directional models that create contextual embeddings. These embeddings are typically coupled to a final layer that performs a specific task such as classification. The principal difference is architecture: ELMo is a deep LSTM whereas BERT is a Transformer encoder. As such, where ELMo is typically employed as a frozen model, BERT can be fine-tuned along with its task-specific layer, which has made it extremely adaptable.

GPT is a Transformer decoder, and is primarily effective for causal language modeling. Its generative character makes it more recognizably similar to the original Vaswani transformer, but the use of only a decoder gives it a radically different structural prior: Vaswani assumes that the source and the target represent the same thing in different encodings, where GPT assumes that the target is caused by the source (and is in the same encoding).