Mixture-of-Experts (MoE) models involve propagating input to several different “expert” learners and then combining them. It is generally distinguished from sequential ensembles because, in many cases, only a subset of the experts will be activated for a given inference. In deep learning applications of MoE, experts are usually activated by an upstream “gating” model that learns to select the optimal learners for a given input. These gating layers may be arranged hierarchically.

The primary benefit of MoE is to enable the use of highly specialized sub-models within a larger model without incurring the computational cost of using them for irrelevant inputs.