Update 2024-09-14

No less than Yann LeCun pours cold water on this idea: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6922344/

Model concepts for Carbon Cloud

David Bruce Borenstein, PhD September 7, 2024

Consider a situation where there exists a corpus of about 50,000 fulltext documents, which must be classified into a hierarchical taxonomy of 2,000 categories. Currently, the documents are embedded using the OpenAI embeddings API. My understanding is that these embeddings are then classified using a shallow fully-connected neural network that terminates in a 1x2000 softmax layer. This approach achieves a validation accuracy of 83%.

The taxonomy itself represents a strong prior belief, so incorporating it into the model could potentially improve accuracy. Along these lines, it may be beneficial to use an evaluation metric that is sensitive to proximity in the hierarchy.

To this end, I propose the use of a sequential softmax classifier that minimizes hierarchical cross-entropy loss. This new model and the baseline model should both be evaluated according to this same loss function as well as according to the existing class-wise accuracy.

At the end of this proposal, I briefly discuss a promising alternative that I thought of after I finished writing this, class embeddings. This may actually be even better.

Sequential softmax classifier

Neural network classifiers typically terminate in a single fully connected layer whose values correspond to raw log-odds (logit) probabilities of each class. This layer is transformed by the softmax function, which transforms these logits into probabilities that sum to 1. The FC layer coupled with the softmax transformation is often called a softmax layer.

In a hierarchical taxonomy, the classes below each level also sum to 1. If the taxonomy is relatively flat (many children per node), then it may be beneficial to employ numerous softmax layers. In this scheme, the next layer selected is conditional on the decision from the preceding layer. Such a scheme encodes the same assumptions as the taxonomy itself.

The taxonomy under consideration has cases where examples can be classified to interior nodes. For this reason, later layers must include a class for each ancestor node.

If the tree is relatively deep (few children per node), then lower layers are unlikely to have enough training data for a fully realized sequential softmax scheme. See “coping with data sparsity” below.

Hierarchical cross-entropy (HCE) loss

For multi-class classification, cross-entropy loss loss is defined in terms of the true label and the predicted probabilities as

Hierarchical cross-entropy loss introduces an additional penalty for distance between the true label and the highest-predicted label. Hence the hierarchical cross-entropy loss can be defined as

Distance metric for HCE loss

Given the discrete nature of taxonomical distance, we should prefer a distance metric based on an absolute difference. The Manhattan distance is one obvious choice, and is appropriate if siblings are ordered in the taxonomy. If siblings are unordered, we should prefer a distance metric based on path. Perhaps the most promising metric is the number of steps from the most recent common ancestor.

Coping with data sparsity

Textual data is extremely sparse in the space of all possible feature combinations. Embeddings improve this, but remain relatively sparse. Additionally, successive layers of the sequential softmax network have exponentially fewer training examples. Hence at more than 2-3 levels, it seems unlikely that the classifier will have sufficient training data. Even at shallower depth, additional/alternative measures may be required.

HCE loss without sequential softmax

While sequential softmax encodes a structural prior about the taxonomy, the HCE loss still encodes a quantitative prior. Hence using the HCE loss with the existing model is likely to improve performance, even in the absence of any other change. Note, however, that depending on the chosen value of the distance weight parameter , accuracy per se may decrease.

Weight sharing

If, at a given level of the taxonomy, certain categories are similar, then it may be helpful to encode hidden layers that are shared between these similar classes. These hidden layers could potentially extract features that are relevant to all included classes, but not to other classes. Obviously, these hidden layers would still need to feed into distinct softmax layers.

Sequential fine-tuning with class-dependent resampling

Depending on the distribution of data within the taxonomy, it may be helpful to show some training examples from rare classes more than once in each epoch, optionally with transformations (data augmentation) to reduce overfitting. This would be particularly beneficial if an entire branch of the taxonomy is rare, but elements within that branch are relatively balanced.

To reduce bias in the overall model, one could consider a sequential fine-tuning approach. In this scheme, the entire model is trained on the full dataset. Then all but the last shared layer are frozen, and the path-specific layers are trained further on examples specific to the class. The obvious downside of this approach is the distortion of class ratios, even in the presence of sequential fine-tuning.

Class embeddings

One variant idea is to first train a class embedding model, and then use these embeddings as the target for the main model. That is, we train two completely separate models. The first learns embeddings for each of the classes in the hierarchy, and the second learns to predict an embedding vector in this lower-dimensional space for each of the training examples. This has the benefit of reducing the cardinality of the prediction target without limiting the data available to the main network.

As a tree is a kind of graph, any algorithm for learning graph embeddings could potentially be relevant for the first model. One classic example is node2vec. Manifold learning techniques such as t-SNE or non-network methods like hierarchical clustering may also be effective. As for the second model, an ordinary fully connected network would suffice.

At inference time, the predicted embedding would be compared to the embeddings for all of the classes, with the predicted class being the one closest to it in this lower-dimensional space.

Decision trees: only superficially relevant

My first thought upon hearing this problem was to use decision trees or tree-based ensembles, as decision trees appear to share a structure with the target. However, the hierarchy is essentially a form of constraint. Unless the training data conforms closely to the hierarchy—which fulltext likely does not—using a decision tree does not provide the model with any new way to discover this constraint. The class embeddings approach, discussed above, is much more likely to do so.