I’m intrigued by the idea of different solutions arising from different training algorithms. In the context of large language models, how do researchers choose between different optimization approaches, and what trade-offs do they consider?

This thread is initiated from an email exchange between Libin and I.

I conveyed my questions and confusions incorrectly. To be more specific and true to myself: I am interested in concretely extrapolating mathematical and mechanistic understandings of machine learning models to complex large language and multimodal models in the direction of the nature of the data and the

upside examples

knowing the application of theories to neural network architecture and engineering allows us to do some cool stuff on many fronts.

efficiency by reducing complexity of attention

  • linformer: self-attention mechanism can be approximated by a low-rank matrix
    • decompose the original scaled dot-product attention into multiple smaller attentions through linear projections, such that the combination of these operations forms a low-rank factorization of the original attention
  • linearized attention: express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from to .
  • performer: approximate softmax attention kernels with linear time design.