In a transformer model, a “position-wise feedforward network”
where
A position-wise feedforward transformation is applied identically to each position in the sequence. Because it is applied with the same weights at each position, it introduces a form of weight sharing across the positions of the sequence.
Recall that the attention vectors represent the expected value of the features at each position in the sequence. Position-wise feedforward network occurs after self-attention, i.e., using these expected values as inputs. Hence the position-wise feedforward network can be seen as a mechanism for learning nonlinear relationships between the expected semantics of each position. The authors compare this to a pointwise convolution, implicitly drawing an analogy between the features in the context vector and channels in a convolutional neural network.
Like basically every other part of the transformer, position-wise feedforward employs dropout for regularization.
In The Annotated Transformer, position-wise feedforward network is implemented exactly as you’d expect:
class PositionwiseFeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.w_2(self.dropout(self.w_1(x).relu()))