Concept drift is change over time in the joint probability distribution of a model’s inputs and ground truth targets. Like Code rot, it is nearly universal; sleepy models in seemingly boring domains are not exempt. Even models trained on the classic iris dataset would probably perform differently on new data due to climate change!

Below, I attempt to catalog some of the dimensions along which concept drift can vary, and some of the ways we can keep an eye out for it.

Causes

Change in feature distribution

Change in outcome for same features

Change in upstream data processing

Change in context

Temporal pattern

Sudden discontinuity

Cyclical deviation

Gradual deviation

Changes in variance

Manifestations

In order of badness, from “not bad” to “very bad.”

Your alerting mechanism activates

This is obviously the ideal case: you detect the problem and fix it.

Error rate spikes

An upstream software change is usually responsible for this. Your model has started getting requests that it doesn’t know how to handle and it freaks out. The error spike will get you paged. Unfortunately, this page will come after hours, because someone stayed late getting all their tests to pass and running their full integration suite, and you were way too far down the line for them to know their actions would affect you. (I’m looking at you, front-end developer who just modified some input box’s validation rules.) However, this is actually a not-terrible situation: your code failed early and loudly, and the worst that you got was a few hours of degraded user experience. It’s honestly pretty hard to prevent this scenario, and you should consider it a success.

Business metrics deteriorate

This, unfortunately, is what usually happens: the model just stops doing its job as well as it used to, and you get called to task. (Or worse, you don’t, and the business just suffers in some preventable way.) Monitoring these business metrics can only help so much, as there can be many possible causes to drop-offs in business outcomes. Your best bet is to have monitoring in place for the relevant distribution. This, of course, requires you to think of the particular failure mode that occurred. Unfortunately, this often happens only in retrospect. Using explainable models can help, but it can be costly (or impossible) to create explainable models that perform as well as incomprehensible shoggoths.

Humiliating crisis

Someone comes up with a new form of abuse that your fraud detection model doesn’t know about. Someone makes a well-meaning change in your system prompt, causing your diffusion model to create images of Asian-American female Nazis. An economic shift causes a protected group to score higher for default risk, resulting in a discrimination lawsuit. These things will get you fired, and possibly canceled. If you work in anything resembling a sensitive domain—and that’s pretty much all consumer-facing ML—you need to observe the shit out of your models. Your career depends on it.

Sources

Using statistical distances for machine learning observability (Arize)

David's raw ML reference notes

Explorer

Concept drift