When performing statistical inference on sequences, it is necessary to encode text into numbers. The typical strategy is to treat the unique words in the vocabulary as categorical variables, defining a unique numerical mapping for each distinct word in the vocabulary. The numerical labels are then one-hot encoded, again as one would with mutually exclusive categories.
An additional refinement comes from the recognition that word meaning is encoded into sub-word chunks. For example, “walking” encodes both a base verb (“walk”) and an indication that the verb’s action is ongoing (“ing”). To capture these subordinate meanings, the word is first broken into “tokens” using a tokenizer.
Sequences, of course, are not a series of independent labels; their meaning depends greatly on the ordering of the elements. So the elements are encoded as not one but a series of vectors, usually represented as a rank 2 or (in the case of batches of sequences) rank 3 tensor.