Neural networks, especially deep ones, are extremely overparametrized. As a result, so long as the structural priors bear any resemblance at all to the target function, stochastic gradient descent will push it towards the correct solution. Even with a poorly matched model, the solution will generally be better than random guessing, and will often be better than simply predicting the statistical mode of the data.
It should not be surprising, then, to find that network architectures stand on vague and casual justifications, even in papers. In fact, some of the most famous papers of all, such as the Vaswani, et al. (2017) transformer, give mostly phenomenological descriptions of what their models accomplish, rather than the rigorous theoretical grounding expected in most quantitative sciences.