A two-tower neural network is not really a single neural network. Rather, it is a pair of embedding models that are aligned via a contrastive loss function such that resulting pairs of embeddings can be compared by a distance metric. It is widely used for collaborative filtering, where the embedding models are usually a stack of fully connected layers. However, it is also used in other contexts. For example, CLIP employs a text transformer and a vision transformer to align text to images.