Layer vs batch normalization

Layer norm is more common in NLP applications, while batch norm is more common in CV. Why?

Batch size: NLP typically uses small batches, which means that averaging over each feature in the batch will introduce much more variation from one batch to the next. By comparison, CV often utilizes large batches.

Variable input size: Text (and sequences in general) can vary in length. Hence when normalizing across multiple sequences in a batch, it can be unclear what normalization constant to use. Hence NLP data are often somewhat incompatible with the assumptions of batch norm.

Consistency between train and inference: At inference time, differences in sequence length or batch may introduce inconsistencies when using batch norm for NLP. This is not a problem for CV.

Spatial structure: In images, feature values are highly correlated with those that are spatially proximal. In text, on the other hand, the highly correlated elements of the sequence could potentially be far away. Batch normalization can help to accentuate these spatial correlations, which can benefit CV generalization greatly. By comparison, the benefit to even a fixed-length text sequence is more limited.

Recurrence: When using recurrence in particular, it is difficult to define how one would even compute batch norm across all of the hidden states in a single layer. By comparison, layer norm is very well defined.

David's raw ML reference notes

Explorer

Layer vs batch normalization

Graph View

Backlinks