Base question

Consider an ML design interview question where they ask you to create a song recommendation system, as with Spotify. Neglect cold start.

Approaches

Two solutions that come to mind are matrix factorization and two-tower. In both cases, you create a vector index of song embeddings. To recommend songs for a user, you use the user’s embedding as a query to fetch the nearest neighbors from the song index.

Two-towers solution

With two-tower, you train separate embedding models for users and songs, and you align them using a contrastive loss function. The input to the user model is a set of engineered features describing everything we know about the user, and likewise for the song. The resulting embeddings are in the same latent space, because you have minimized the distance between matching pairs.

Matrix factorization

With matrix factorization, your conceptual starting point is a matrix whose rows are users and whose columns are songs; the entries represent the user’s interactions with that song. This implies that you have a numerical representation for any possible type of interaction with the song. You are trying to decompose this matrix into two matrices that project users and songs respectively into a shared latent space. Our framing for a direct solution is singular value decomposition.

In practice, since we have a large number of users and a large number of songs, so directly using SVD is intractable. Instead, we initialize our two projection matrices randomly and then learn them through stochastic gradient descent with triplet loss, using paired positive (did interact) and negative (did not interact) examples.

Having done this, we embed all the songs and all the users.

Comparing the approaches

Matrix factorization requires far less feature engineering, but is completely phenomenological. Additionally, it has a structural prior that latent information about the user has a linear relationship to their song preferences. Two-towers incorporates both collaborative and content filtering into a single step, but it requires much more upfront work. In practice, MF could be used to discover latent features about users and songs, which could then be used to guide feature engineering for two-towers.