• Tokenization = splitting up of words into sub-units (“tokens”)
    • e.g., tokenization token + ization
  • Essential pre-processing for most NLP tasks
  • Knowledge-based tokenizers began to give way to statistical approaches in the 2010s
  • State of the art typically uses fine-tuned BERT variants
  • Byte-pair encoding is neither statistical nor knowledge-based, but is widely used
    • e.g., OpenAI uses byte-pair encoding for ChatGPT APIs
    • Cheaper and faster than LM-based tokenization
    • The rationale here is that the model itself will learn the deeper relationships between these tokens