Stemmers and lemmatizers

Stemmers and lemmatizers are both tools for transforming a word to a base form, so that all of its forms can be recognized as the “same” concept. The difference is that a stemmer follows a set of rules that may mutilate certain words, while lemmatizers try to transform the word in a way that respects the underlying meaning. For example, the word caring might be stemmed as car, but it would be lemmatized as care.

Until the 2010s, stemmers tended to predominate, because they were much cheaper and they could better handle novel or unusual words than early lemmatizers (which depended on word lookups). In the Python ecosystem, NLTK was the go-to for stemmers. Improved statistical methods (especially those based on deep learning) allowed lemmatizers to close the gap for unusual words, and falling computational cost made them more feasible for most applications. With the advent of Transformer models that could be fine-tuned for lemmatization, stemmers became largely obsolete for most practitioners.

State-of-the-art lemmatizers are typically accessed via spaCy or transformers. As of 2024, all of the best-performing lemmatizers depend on a Transformer architecture, with many of them using variants of BERT.

David's raw ML reference notes

Explorer

Stemmers and lemmatizers

Graph View

Backlinks