ML-SDI ch. 4

The authors set up a situation where you are to provide a list of videos, ranked for relevance, based on a text query. There is no personalization and no other modalities are involved. The videos have text metadata. The authors appear to assume that issues like trends and seasonality don’t matter. (These can be taken into consideration in the absence of personalization to the user.)

Their approach appears to be to “fuse” the results of two separate approaches:

  1. They will align a video embedding model and a text embedding model using contrastive learning. They will then create an ANN index of videos. At query time, they will embed the query and find its nearest neighbors.

  2. They will use old-school NLP techniques (BOW or TF-IDF) to produce a decontextualized vector representing something. As far as I can tell, this is not specified. However, let’s assume that it’s the text description of the video. They load this into Elasticsearch to do something with it. Let’s assume it’s to compare the text description to the search query.

I assume that “fusion” is some unstated form of ensembling. After we have chosen neighbors via this “fusion” strategy, we re-rank according to some business rules, then evaluate using mean reciprocal rank.

My take

I don’t really know where to start. Obviously, doing ANN and fulltext search and then “fusing” them likely involves handoffs with many components, which raises concerns about tail latency and multiplicative risk of network faults. Then there’s the fact that “fusion” isn’t really defined here at all, nor is what they’re doing with Elasticsearch. And then there’s the fact that techniques like TF-IDF leave all the context on the cutting room floor, and that fulltext search has largely moved to the comparison of embeddings.

No, the biggest thing is that fulltext has fucking nothing to do with this. I cannot think of a single thing that the ES arm of their ensemble adds to the mix. If this were the only way they could think of to make use of the video description, then I don’t know which part of this sad mess is the most concerning.

Conceptually, this is a relatively straightforward problem with a relatively straightforward solution:

  1. Use a multi-modal LLM that supports both text and video, like OpenFlamingo, to embed the videos and their captions together.
  2. Use a transformer that produces fulltext embeddings, such as Sentence BERT (SBERT, S-BERT), to embed the search queries.
  3. Use contrastive learning to align these embeddings. You can just stick on an equal-dimension embedding layer and use the rest frozen, or you can fine tune, according to your budget.
  4. Proceed as in the embedding arm of their solution.

If someone suggested the book’s design during an interview, I’m pretty sure they would not get a job. But hey, evaluation using MRR seems fine!