One long-standing strategy for collaborative filtering is to create a sparse matrix , where is the number of users and is the number of items. The entries in represent some measure of interactions (views, explicit rating, etc). One then attempts to decompose into matrices and , where is a number of latent factors, such that

Such an approach encodes an assumption that interaction patterns can be adequately expressed as a linear combination of latent factors.

Finding the factor matrices

Numerical approximation approach

For relatively small and , it may be efficient to perform singular value composition such that

where is a diagonal matrix of singular values that has been truncated to the largest.

Optimization approach

For larger and/or , it becomes preferable to learn and using stochastic gradient descent. Note that this is not the same as training a model that takes user interaction data as an input and returns item scores as an output, known as neural collaborative filtering. Rather, this involves training two embedding matrices, then using the matrices directly.

In this case, we start with interaction data for users. At training time, we obtain the user embedding and the product embedding . The probability of interaction is the dot product of the two:

In PyTorch, this would typically be done using nn.Embedding layers:

class MatrixFactorization(nn.Module):
    def __init__(self, num_users, num_items, num_factors):
        super(MatrixFactorization, self).__init__()
        self.user_factors = nn.Embedding(num_users, num_factors)
        self.item_factors = nn.Embedding(num_items, num_factors)
 
    def forward(self, user, item):
        user_embedding = self.user_factors(user)
        item_embedding = self.item_factors(item)
        return (user_embedding * item_embedding).sum(1)
 

The training data would consist of an equal number of sampled positive and negative interaction examples; i.e., cases where the user did or did not interact with the product.

Making recommendations for known users

The forward method gives you a pointwise interaction. If we now wish to recommend products for known users, we can project the user’s latent representation into the product space to obtain a vector of product probabilities:

The resulting product probabilities can be ranked to produce a candidate set for recommendation.

Cold-start recommendation

Often, we must make recommendations for unseen users. New users have no latent representation. In this case, we treat the user as a vector of interactions. In this “warm-start” case, only the product matrix is typically used to make recommendations.

Let’s say you have a new user for whom you have some interaction data. The user’s interaction data is represented as an -dimensional vector , where again is the number of items. We first project the user’s interaction data into the latent space using the product matrix :

.

Having so done, we can now project this embedding back into the product space, resulting in recommendations for each product:

So given a vector of user interactions for a new user, the final calculation is

In a completely cold-start scenario, where no user interaction data exists, one can average over the set of known user embeddings to obtain and use this vector as if it were a known user.