TODO

  • clearly define “alignment” from previous alignment literatures (medium)

Motivations

I often have a lot of conversations with the strongest chat models (ChatGPT, Claude, or Gemini) that would make me go “can’t you just say xxx is xxx, which is xxx”, that kind of concise stuff? To that end, I have tried adding custom prompts when I can (ChatGPT lol) that says “be terse and concise, no yapping.” and sure enough it works to a certain extent, but it’s still an outcome by chance. And I want to leave as little to chance as possible.

  • the alternative to doing this would be copypasting “be direct and technical” every time, but that’s not quite a good UX, right?
  • plus, it would be awesome if the model would know when to be elaborative and when to be concise… I think that this means aligning one model based on some blackbox “preference” is fundamentally a solution constructed on a false premise: your helpfulness isn’t my helpfulness!
  • this is the driving reason I believe in individual alignments, not a general one: I simply do not believe in having some entity tell me via RLHF that “you’re supposed to converse this way.”
    • knowing the difficulties in fighting back this pattern, I still decided this will be the most important thing to work on in the coming years.

Furthermore, for a model to be able to adapt a relatively narrow conversational style is analogous to giving it (or letting it adapt through training) a personality, contrary to modern LLM alignment approach of being an dehumanized assistant. Didn’t Pi aimed for what I thought would be nice to have? They flopped hard, maybe the wouldn’t have if I interned there… but then again I didn’t get to write this out from my latent space until 2024.

Related Work

Paper 3 has profound implications into providing an artificial system with arbitrarily real “internal monologues,” which could be an efficient way of distilling personality traits into a model.

AlpaGasus prompt a strong API LLM, i.e., ChatGPT, to produce a score for each triplet of (instruction, input, response). …bruh, not a good start to this direction

Experiment

We can do something with the Capybara dataset by LDJ. The dataset is a great starting point because it contains long, high-quality turns of conversations. In reality tho, we don’t really all talk like that. Therefore, on the principle that deep learning based models needs training on these data, it also needs training to be able to not output those in real inference situations: that is, if we keep building human-centered systems, which I intend to do.

The high level idea is to ask a capable but lobotimized model to rewrite each conversation to be a shorter representation or conclusive output of the longer conversation; we also include in the prompt that the new shorter dialogue will be used to help the model internalize longer thoughts as inner monologues.

  • yes i am doing that by prompting I am tired of merrygorounding “but prompting is just betting on the model’s capability” (which i always tell myself this) I AM MAKING THE BET!

This means we should no longer distill longest 2,000 conversations!