Misc. Notes

These are the things I previously knew from interacting with gpt-3 and ChatGPT in Nov. 2022.

  • Training autoregressive language models is analogous to maximizing the probability of certain tokens given some input;
  • Base models trained with autoregressive objectives aren’t very helpful the way completion may output coherent language but not what the users may want;
  • Finetuning is the stage of training the model on specific input-output pairs such that the model learns to generate outputs the way users want;
    • we do this via applying chat templates where there are tokens between inputs and outputs;
  • Training vs Finetuning:
    • pretraining is not limited to a certain domain. it is used to let the model learn;
    • finetuning is to let the model do well on a specific domain (question answering, etc)

Seemingly Obvious Lessons

  1. No need to jump straight to fine-tuning
    1. This feels intuititive! Even just by knowing chain-of-thoughts, a reasonable intuition is that there are lot of work to be done on the downstream input side: we can call it data science, even.
    2. Working on evaluating LLM fine-tuning also makes this obvious. When I fine-tuned a medical language model,
    3. I’m not surprised this needs to be said, but it feels good to have my intuitions and understandings verified by more experienced people in the field.
  2. We need consistent templating between training and inference.
    1. Also kinda obvious, although, I have not tried to fine-tune the same LLM sequentially on datasets of two different templates.
    2. One weird thought to have is that template tokens, such as <|im_start|>, that persist across samples can effectively act as a regularizer during the optimization process; from a simple spatial imagination, this can be seen as to guide the model into a certain basin (or plateau, depending on your ascend/descend taste) that corresponds to being conversational/helpful/etc.

Real examples

Unconventional inputs

What could this mean? It is natural to think that text generation tasks are given texts and output texts. So, when Dan showed the example that wanted to use an LLM to predict the value of a shipped items based on an 80-char description, I thought “instead of fine-tuning, is this where people train linear regression heads for”, and not 5s later I realized I didn’t know what a linear head (or linear probe, as a mental detour) is.

What are Linear Probes?

Back to the problem

The problem Dan et al. observed were that

  • response were round numbers (10, 15, 20, etc.), and they are not great at getting approximately right values. This indicates an inapprorpiate loss function:
    • Also intuitively makes sense! the loss function of the LLM would assign loss based on token likelihood, this is not necessarily associated with the correct value; even it is associated (which it won’t be),
  • The training data contained a lot of small values, many of them are also wrongfully small: this can lead to imbalanced training datasets, which is a classic data science problem. When working on tabular data prediction problems, we could balance the classes in the training data with scikit-learn; it is only natural that it becomes more relevant in overparametrized regimes such as LLM.