ML-SDI ch. 2
The authors posit a scenario where the reader is asked to create a visual search system similar to that of Pinterest. It is framed as a pure image similarity ranking task:
- The system is unimodal.
- There is no personalization.
- The model does not exploit metadata.
- There is no content moderation.
Their high-level approach is to compute image embeddings using a CNN and then identify similar images using a vector database for ANN. They will train the CNN using an unspecified contrastive loss (probably triplet). To construct the training data, they propose to use data augmentation to create an image that is by definition similar to the source image, and then choose
My thoughts
Summary
I can see why they would take this approach as an educational aid, but they do a disservice to suggest that this would be a good answer at a system design interview. What they describe is a large lift and the embedding model will generalize much less than an off-the-shelf pre-trained model. Interviewers are hiring you to help them make money.
Specifics
First of all, it seems like a stretch to say that this task has any resemblance to Pinterest’s search engine. Pinterest provides similar images based on signals of personal preference, especially with respect to style and palette. Pinterest’s visual search seems like a collaborative filtering task. You’ll probably want to use a two-towers model so that you can encode targeted features, rather than just leaving it to gradient descent to figure it out from scratch. There’s a lot to know about taste, style, and demographics, after all. And then there’s re-ranking for diversity so that the user doesn’t get bored, remains aware of trends, etc.
But let’s suspend disbelief and take the task at face value. We have to find similar images based on nothing but pixels. Doing ANN on embeddings does seem like the obvious answer—no objections there. The part I don’t love is the way we created our training examples. I expect rather poor generalizability if we use nothing but data augmentation. It’s going to be hard for the model to discover anything abstract, and so the results are going to resemble each other only geometrically. Search an image of a striped rug, and you’re just as likely to get back a radiator.
You’re going to get vastly better results out of the box by using a pre-trained vision embedding model, even if it’s not fine-tuned for similarity ranking. ViT comes to mind, though it will be vastly more expensive than a CNN. You might get decent results just using ResNet. I suspect that there are nice pre-trained CNNs as well. In a business setting, you stick a re-ranker in front of it, and you’ve probably got something good enough for an MVP.
Now let’s say we want to improve on this. The instructions said that the model should use only pixel data, but it didn’t impose restrictions on how we build our training data. On the contrary, they explicitly permit the use of user interaction data to construct the training set, so we should assume all the other sources of data are in play for training set construction as well. If this company has any sort of tagging, you could probably make some cheap progress by just using items with shared tags as positive examples and items without shared tags as negative examples, and then fine-tuning your pre-trained model.
nDCG is a reasonable place to start, since it basically tells us how well the model learned our augmentations. But since we basically self-supervised, it’s also kind of circular. It’s not going to tell us much about what users will think of our “relevance” scores. We’re really going to need to lean on online metrics until we get some feedback from users in the form of clicks on our search results.