08 Ranking metrics in practice

Suppose we have 100M items in a database, and we want to create a search engine. The user runs a query. They get 10 items on the first page. We have a number of options for establishing an ideal ranking of those items, and we could extend that to the first . But it turns out that there is an ideal item, and it’s ranked 54,321,987. How would we ever discover that?

In short, we wouldn’t. There’s simply no feasible way to establish a definitive ground-truth ranking with most large search problems, especially as “relevant” is (to some extent) often in the eye of the beholder. So what is a relevance engineer to do?

The basic strategy is to create a multi-pass ranker, and to worry about relevance for each pass separately. The first pass selects some reasonably large number of candidates, such as 1,000. Very often, these early candidates are selected primarily according to the content. They usually employ cheap, efficient heuristics, such as a defeatist nearest-neighbor search based on relatively low-dimensional embeddings. Using ground-truth binary labels of relevance, such a ranker can be cheaply evaluated using metrics such as mean average precision.

These candidates can then more carefully re-ranked in one or more additional passes. For large datasets, there are usually at least two more: the initial candidates are filtered in some relatively inexpensive way before applying an expensive algorithm to a final candidate set. Given the small , this final re-ranking can more easily be labeled with well-defined rank orders.

All this introduces considerable indirection. In practice, therefore, large scale retrieval systems tend to require ongoing experimentation, such as multi-arm bandits, to assess how well they are meeting online (business) metrics. This is, of course, in addition to monitoring for concept drift.

David's raw ML reference notes

Explorer

08 Ranking metrics in practice

Graph View

Backlinks