Background
Comparison with matrix factorization (MF)
Although it sounds like “neural collaborative filtering” should refer to any technique involving a neural network, it actually refers specifically to a variation on the matrix factorization approach to CF. Recall that MF involved finding embedding matrices
When making recommendations for a known user, their latent representation
Use case
NCF retains the structural prior that interaction probabilities can be expressed as a defined relation between previous users and their interaction histories. However, it introduces the possibility that those interactions are nonlinear.
However, recall also that, during training, MF generates pairwise interaction probabilities between a particular user and a particular product. Using the expression above, this can then produce an interaction probability for every single product using only two matrix multiplications.
NCF also produces pairwise interaction probabilities. But unlike MF, you can’t then turn around and produce probabilities for every product at once; you have to do them one at a time. This can be parallelized using vector operations, of course, but it’s far more computation than MF.
In practice, therefore, NCF is typically used as a second-pass ranking method after a more efficient method (like MF) has produced a shortlist of candidates.
Implementation pattern
Pointwise ranking (most common)
NCF is most commonly used as a pointwise ranking algorithm:
- The model takes some combination of one user representation and one item representation as input.
- It passes this input through some neural architecture.
- It returns a probability of the user interacting with this product.
The most straightforward implementation is as follows:
- Pass the sparse user and item vectors into learnable embedding matrices.
- Concatenate the resulting embeddings.
- Pass the concatenated embedding through a multi-layer perceptron.
- Terminate the network in a 1-dimensional probability task head.
- Convert the resulting logit to a probability using the sigmoid function.
In this paradigm, one would typically train the model using negative sampling. You’d typically be minimizing binary cross-entropy loss in this regime.
Pairwise ranking
The sample space of a NCF training dataset is the product of all users and all items. In other words, it is very sparse. That sparsity can begin to create issues with convergence and/or accuracy, particularly if users and their interactions are concentrated across “islands” of this space (as one might expect for, e.g., general-audience e-commerce sites).
In this situation, it can be helpful to instead choose a ranking strategy that focuses on the relative ranking of relevant and irrelevant products—i.e., pairwise ranking. This is accomplished by choosing two items for each training example: one positive and one negative. We then use a contrastive loss function to maximize the distance between the examples. The two common choices are:
- Triplet loss. Arguably the most intuitive, triplet loss interprets the model’s output as a distance between the user and the item, and trains the model to enforce some margin between the positive and negative example.
- Hinge loss. Very similar to triplet loss, except it treats the inference as a probability. I find this to be a little harder to think about, but it probably shouldn’t make any difference compared with triplet loss.
The original “contrastive loss” is mostly designed around creating embeddings. The output would still be a measure of relevance, but doing so would add the complexity of a pairwise approach without addressing the sparsity that motivated it.
At inference time, we’d pass in each item separately. As with pairwise ranking, the resulting distances/probabilities would represent a relative ranking of the candidates.