TO DO: ChatGPT had some useful feedback — incorporate it
At the coldest of starts
At the very start of a retrieval-based application, search and recommendations will work basically the same way: lacking any information on user behavior, we can only depend on item similarity. This is pure content-based filtering.
Even if we know a lot about our items, we’ll probably want to choose a very simple scoring strategy so that we can launch an MVP and begin collecting user data. If we have a very small number of items, even a boolean predicated search may even be enough.
Typically, though, we will depend on some kind of scoring. Inverted indices are our friends here: they let us rule out a large number of candidates before applying our scoring strategy to the remainder. For text, this will likely mean a full-featured search engine like Elasticsearch or Solr. For images, we will likely reach for an off-the-shelf embedding model and then use an ANN tool like FAISS.
The major cloud providers all offer managed search and vector retrieval options, and these are an excellent option at an early stage: they let you focus on your scoring strategy and avoid spinning your wheels on system administration. At small scale, the operating cost of these services is negligible (or, in many cases, free).
Refining our content-based filtering approach
After we have our MVP, we can begin to refine our content-based filtering approach. If we have additional information about our items, such as annotations or metadata, we can start to incorporate this into our retrieval system. At the early stages, it probably makes sense to incorporate this information with predicates. Eventually, we might start to experiment with more sophisticated scoring strategies, such as ensemble models or neural networks, and include this information as engineered features.
We don’t know much about our users’ behavior yet, though, so we will want to introduce these changes in a systematic way. In particular, we will want to identify our most meaningful business metrics, then run experiments to optimize for them. There are many strategies for experimentation, the most straightforward of which is B testing. Startups should focus on these simple approaches, especially if they don’t yet have a lot of traffic.
If we are building an internal or academic tool, such as a search engine, we will be happy with a really good content-based filtering model. If we are running a business, though, our goal is to launch a successful offering. And this means focusing on what customers actually want, rather than what we know about the products. Do they really want to look at something similar to what they’re looking at now? Maybe not! And so once we are collecting meaningful data, we want to introduce collaborative filtering as soon as possible.
Early collaborative filtering
Again, we want to start off simple with collaborative filtering. A great option is to start with matrix factorization. There are four big advantages to this approach. First, it doesn’t require any feature engineering. Second, it actually gives us features, since the resulting embeddings can be clustered, analyzed, or used for more complex models. Third, it works better with modest datasets than a more complex model. Finally, it’s really easy to implement and train.
We don’t want to replace our content-based filtering system wholesale, especially not at first. We are likely to have much richer data about the products than about user interactions, which are inherently sparse. Here’s where our search and recommendations systems will start to diverge. A tried-and-true approach is to use content-based search results and to display recommendations on product and landing pages. These could be user-to-item or item-to-item (or an interleaving of both). Either can be efficiently supplied by loading the item embeddings into a vector index and then querying for nearest neighbors.
Multi-stage collaborative filtering
Let’s be honest: most projects don’t reach the point where they can justify multi-stage ranking. But if we have gathered all the low-hanging fruit from our content-based search engine and our collaborative recommendations, we can start to look at combining them. Likewise, if we now have such a diverse and varied pool of candidates that linear recommendation engines simply can’t cut it, we may be forced to reach for expensive algorithms. Alternatively, we might be introducing “promoted results” (ads) into our search or recommendation systems. The list goes on.
At this stage, it’s hard to describe a general playbook; every situation will be different. But the basic idea is that you get a lot of candidates cheaply, you filter them in some way, and then you apply a more expensive scoring strategy. You might do this more than once; you might inject candidates from another source. Your candidates might come from multiple first-pass rankers. You have a bewildering number of options. And you’re going to drown if you don’t act systematically.
Continuous experimentation
So this is where continuous experimentation comes in. A/B tests are very powerful and can carry you a long way. But with each new option, your search space multiplies. By the time you’re using a multi-stage search or recommendation engine, you have an astronomical (and possibly infinite) number of experiments that you can run. Experiments are risky, and you want to make sure you’re getting the best possible results. So if you have enough traffic to support it, it’s time consider strategies for continuous experimentation. Examples include:
- Multi-armed bandits for maximizing the number of times you use the better of two (or more) options. MABs try each option with a certain probability, shifting the distribution in response to observed outcomes.
- Contextual bandits for exploring possible parameters to a fixed set of options. Contextual bandits gradually shift parameter values in the direction of better performance. It can be thought of as a bit like gradient descent.
- Bayesian optimization for exploring arbitrarily complex search spaces. (This is an advanced option that can easily backfire if you don’t have the right staffing for it.)