In a transformer model, a “position-wise feedforward network” is a two-layer feedforward neural network with a single (ReLU) nonlinearity in between them:

where is a row vector. This is inconsistent with conventional notation, but it makes it a bit easier to read (and is how they gave it in Vaswani, et al.).

A position-wise feedforward transformation is applied identically to each position in the sequence. Because it is applied with the same weights at each position, it introduces a form of weight sharing across the positions of the sequence.

Recall that the attention vectors represent the expected value of the features at each position in the sequence. Position-wise feedforward network occurs after self-attention, i.e., using these expected values as inputs. Hence the position-wise feedforward network can be seen as a mechanism for learning nonlinear relationships between the expected semantics of each position. The authors compare this to a pointwise convolution, implicitly drawing an analogy between the features in the context vector and channels in a convolutional neural network.

Like basically every other part of the transformer, position-wise feedforward employs dropout for regularization.

In The Annotated Transformer, position-wise feedforward network is implemented exactly as you’d expect:

class PositionwiseFeedForward(nn.Module):  
    def __init__(self, d_model, d_ff, dropout=0.1):  
        super(PositionwiseFeedForward, self).__init__()  
        self.w_1 = nn.Linear(d_model, d_ff)  
        self.w_2 = nn.Linear(d_ff, d_model)  
        self.dropout = nn.Dropout(dropout)  
  
    def forward(self, x):  
        return self.w_2(self.dropout(self.w_1(x).relu()))