- Tokenization = splitting up of words into sub-units (“tokens”)
- e.g.,
tokenization ⇒ token + ization
- Essential pre-processing for most NLP tasks
- Knowledge-based tokenizers began to give way to statistical approaches in the 2010s
- State of the art typically uses fine-tuned BERT variants
- Byte-pair encoding is neither statistical nor knowledge-based, but is widely used
- e.g., OpenAI uses byte-pair encoding for ChatGPT APIs
- Cheaper and faster than LM-based tokenization
- The rationale here is that the model itself will learn the deeper relationships between these tokens