From Chaudhury, et al. (2024)

New highlights added June 25, 2024 at 2:45 PM

  • for large values ofw, the parameterized sigmoid is virtually indistinguishable from the Heaviside step function (View Highlight)
  • the derivative (gradient) of the function near x = 0 is much higher for tanh than for sigmoid. Stronger gradients mean faster convergence, as the weight updates happen in larger steps (View Highlight)
  • Why do extra layers help? The primary reason is the extra nonlinearity. (View Highlight)
  • Such a layer is also known as fully connected layer. (View Highlight)
    • Note: Linear layers are synonymous with fully connected layers. Note, however, that this has no implication about the activation function of the neurons. This is because the activation function is often modeled as a separate “layer.”
  • The weight of the connection from the kth neuron in layer (l −1) to the jth neuron in layer l is denoted w(l ) jk . (View Highlight)
  • (View Highlight)
    • Note: Interpretation of the superscripts and subscripts used in describing neural network connections
  • (View Highlight)
    • Note: In this weight matrix , the rows correspond to target neurons in layer , and the columns correspond to source neurons in layer . The row index is and the column index is . We write an entry as .
  • The final output of an MLP with fully connected (aka linear) layers 0 · · · L on input x can be obtained by repeated application of this equation (View Highlight)
    • Note: A neural network can be understood as the composition of the linear combination and activatiion functions for each layer.
  • The complicated equation 8.9 is never explicitly evaluated. Instead, we evaluate the outputs of successive layers, one layer at a time, as per equation 8.10. Every layer can be evaluated by taking the previous layer’s output as input. No other input is necessary. (View Highlight)

New highlights added June 25, 2024 at 4:45 PM

  • In general, if a loss, denoted 핃, is expressed as a function of the parameters, such as 핃  w, b , then the change in the  parameters that optimally takes us toward lower loss is yielded by the gradient of the loss with respect to the parameters ∇w,b핃  w, b . (View Highlight)
  • the loss (View Highlight)
    • Note: uh, you mean “the MSE loss”
  • (View Highlight)
    • Note: I find this unhelpful, as it just makes the equations below harder to read (since I have to keep going back and checking what means). I think the idea is that, in implementing this, you need to keep track of Missing close brace\frac{\partial L}{\partial z^{(l)}, and maybe people usually call this “deltas.” That said, it’s not really an “auxiliary variable” in the sense that we usually use them: as a substitution that allows us to complete a mathematical analysis that otherwise would run into trouble during intermediate steps.
  • L (View Highlight)
  • we can itera- tively evaluate the outputs of successive layers using only the outputs from the previ- ous layer (View Highlight)
  • A similar trick can be applied to evaluate the auxiliary variables, except we go backward. (View Highlight)
  • Z (View Highlight)
    • Note: Where did this come from? There’s no Z operation in PyTorch!
  • the loss (View Highlight)
    • Note: Again, the insistence on MSE as “the loss” makes me wonder how worthwhile it is to read this carefully. The analysis also depends explicitly on the sigmoid activation function, but at least he’s upfront about that.