When working with a pretrained neural network model, the workflow is different (and much simpler) than working with an existing model. To that end, here is a simpler sequence of steps based on Karpathy’s recipe. Note that this assumes a commercial setting where we do not have the luxury of, as Karpathy says, “squeezing out the juice”: once the model is working well enough to be better than nothing, we have to validate our market assumptions with an experiment. Note, though, that “better than nothing” is often a rather high bar!

TODO DATA AUGMENTATION / CLASS IMBALANCES

1. Thoroughly explore dataset

  • Data cleanliness issues
  • Class imbalances
  • Collinearity / obvious dependencies
  • Low-hanging feature engineering fruit…
    • …and make note of harder things that might help
  • How would a person go about making predictions?
    • If practical, actually make predictions by hand to establish an (approximate) upper performance bound

2. Verify correctness of inputs

  • Vision models: visualize input examples
  • Sequences: eyeball sequence elements
  • Embeddings: visualize using t-SNE or PCA
  • etc.

3. Establish baselines

  • Fix random seed (for repeatability)
  • Mode (for classification), median/mean (for regression): must beat this
  • Confirm initial loss conforms to expectations
  • If applicable, establish another baseline after most obvious forms of data augmentation
  • Optional — may help to initialize the biases in the task head to match high level expectations
    • Classification:
    • Regression:

4. Overfit to a test batch

  • Fit until network has very low training loss (possibly approaching zero)
  • Visualize prediction dynamics over time
    • Tweak learning rate if predictions are jumping around
  • Look at any examples it learns very slowly
  • Try with a stupid loss function to make sure you understand how gradient changes

5. Overfit the frozen model

At this stage, it makes sense to keep the original pretrained model frozen. That limits the search space of our optimization problem, though the options are still considerable:

At each iteration, we can examine the cases that the model is failing at to figure out what new features to build,. The goal is to push the training loss as low as we can possibly get it, without worrying one little bit about validation loss. We’ll get to that.

6. Regularize the frozen model

Now we try to help this model to generalize by adding some regularization. Again, our options are fairly limited, but we do have a couple of tools at our disposal:

  • Batch and layer normalization between the frozen model output and the task head (TODO figure out when we’d prefer each)
  • Dropout at the input to the task head
  • Weight decay (equivalent to L2 regularization) at the input to the task head

At this point, we can also introduce a learning rate schedule and early stopping. By the end of this step, we should feel that the frozen model has reached the point of diminishing returns for further tweaking.

7. Fine-tune the foundation model

Unfreeze the foundation model and allow the model to fine-tune with a much lower learning rate than before. Other than playing around with learning rate schedules and early stopping, there isn’t much need to play around with anything at this stage.