There is usually a special token of the vocabulary, <unknown>, for unknown tokens. So when a transformer encounters a completely novel token, it gets replaced with <unknown>, one-hot encoded as such, and embedded exactly like all the other unknown tokens in the input sequence.

During training, some of the attention heads are likely to learn representations that focus more on context and less on the input vector. At inference time, these attention heads will impute an interpretation of the token from context, just as a human does.