Tokenization (tokenizer)

Tokenization = splitting up of words into sub-units (“tokens”)
- e.g., tokenization ⇒ token + ization
Essential pre-processing for most NLP tasks
Knowledge-based tokenizers began to give way to statistical approaches in the 2010s
State of the art typically uses fine-tuned BERT variants
- Often accessed via transformers or spaCy
- Mentions to NLTK are generally outdated
Byte-pair encoding is neither statistical nor knowledge-based, but is widely used
- e.g., OpenAI uses byte-pair encoding for ChatGPT APIs
- Cheaper and faster than LM-based tokenization
- The rationale here is that the model itself will learn the deeper relationships between these tokens

David's raw ML reference notes