ML-SDI ch. 9
The described task is to provide recommendations for a vacation rental website based on the user’s recent actions. The text refers to this as “similar listings,” but goes on to say that the goal is to make recommendations designed to maximize booking probability.
At any rate, they end up framing this “similar listings” task a session-based recommendation task. This is a significant departure from the initial setup, but let’s suspend disbelief for the moment.
The approach they propose is to learn embeddings for each listing based on co-occurrence between users’ browsing history. They then index these embeddings in a vector database and do ANN followed by re-ranking.
They notice that there are constraints on the eligible candidates and so add additional loss terms. The overall process by which they train this thing is kind of hand-wavy, but it basically sounds like they compute a distance between the central listing and a positive listing as well as various negative listings based on different criteria, compute a pairwise loss for each, and then sum over these losses to get the total loss for the example.
This isn’t a “similar listings” problem
Mere “listing similarity” refers to the characteristics of the listing itself, not user behavior with respect to it. However, in the feature engineering section, they state:
[T]he model only utilizes users’ browsing history during training. Other information is not used, such as listing price, user’s age, etc.
There are a few interesting ways to set this up as an ML problem, but none of them are “listing similarity.” Once again, this book does the undiscerning reader a disservice: if an interviewer asked you for a similarity search, and you gave them a recommendation engine, you would not get the job.
The approach itself is okay, but poorly explained
Again, the task they perform is not the task they tee up in the problem statement. But if we take the task performed as a given, the approach turns out to be reasonable, if a little old-school.
The way they are framing it is to take as input a vector representing the sum of all one-hot-encoded elements in a context window, then attempt to predict the central item. This is basically how the CBOW variant of word2vec works. Consequently, they use the same loss strategy as word2vec:
- Compute the distance between two items.
- Apply a sigmoid transformation to turn it into a “probability” that they are near one another.
- Use cross-entropy loss to compare this to a binary label of “nearness.”
As with word2vec, they employ negative sampling to produce a tractable number of training examples with reasonable class balance.
This is a nice setup (for the task they actually perform) because it naturally takes in a set of recent items and returns a listing embedding. But without establishing that they’re basically doing word2vec on something other than words, it all looks very obscure.
They’re throwing away data
That said, these clicks are coming in a sequence. And what do we want? Another set of
Even though the transformer will be slower to inference, the system will be overall faster because we don’t need to get data from the model to the ANN to the re-ranker. Network delays will be the biggest pain-point there, and you just got rid of all except the round trip with the user.
But let’s say you really want this to be a ranking problem
Since we apparently get to take the instructions and chuck them out the window in favor of what the authors think is cool, let’s say this had to be a ranking problem. We still don’t need to ignore order.
Instead, we can use an RNN (or one of its variants) to model the sequence itself, then tack on a d-dimensional embedding layer. We use contrastive or triplet loss to push the embeddings from positive examples closer together and the embeddings from negative examples further apart.
This gives us back our ranking problem so that we can claim to be making this into some kind of “similarity” problem (sort of). Unfortunately, we have all the same sources of latency as before, and now we’ve got this RNN that’s slow and hard to parallelize.
If latency proved to be an issue, we could trigger an async (near-real time) inference task as soon as the user clicks on a new listing, so that by the time they get to the recommendations, the task will have completed.