I’m intrigued by the idea of different solutions arising from different training algorithms. In the context of large language models, how do researchers choose between different optimization approaches, and what trade-offs do they consider?

This thread is initiated from an email exchange between Libin and I.

I conveyed my questions and confusions incorrectly. To be more specific and true to myself: I am interested in concretely extrapolating mathematical and mechanistic understandings of machine learning models to complex large language and multimodal models in the direction of the nature of the data and the

upside examples

knowing the application of theories to neural network architecture and engineering allows us to do some cool stuff on many fronts.

efficiency by reducing complexity of attention

linformer: self-attention mechanism can be approximated by a low-rank matrix
- decompose the original scaled dot-product attention into multiple smaller attentions through linear projections, such that the combination of these operations forms a low-rank factorization of the original attention
linearized attention: express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $O (N^{2})$ to $O (N)$ .
performer: approximate softmax attention kernels with linear time design.

Gaia Prime

Explorer

Training And Or Optimization Algorithms For Decoder Only Transformer Models

upside examples

efficiency by reducing complexity of attention

Table of Contents

Backlinks

Graph View