*relevant papers and projects at the end

Coming into AI headfirst from the scaling law, I immediately had the thought that in order to break the threat of “running out of data”, we have to start making progress in data synthesis.

There are two fronts to building a good neural network based local assistant that synthetic data will materially accelerate progress:

  • cross-modality
  • cross-task Tackling each of them requires different areas of expertise.

side note: I feel like saying “synthetic data cannot …” fundamentally forgot chain of thought is a thing, or matter of fact any other prompt engineering.

Two implicit benefits of synthetic data

On synthetic data, Hanchi Sun said it can

  1. expand the solution manifold of inverse problems, many of which are NP-Hard
  2. inject inductive bias through data augmentation heuristics.

Specifically about point 1: using a robust, well-defined and liberal synthetic data pipeline, we can bring values to previously unusable raw data that could not fit in conventional learning systems.

  • For example, if we want to incept a system that can tell us what we did (description of actions) based on a sequence of screenshots, we can increase the model’s capacity of doing so through synthesizing such by prompting existing capable models by: first generate descriptions for each image, then combine the situation descriptions of each image into one prompt and let a model generate (describe, if you want to anthropomorphize it) what the user did.
  • The pipeline involves a good imagetext model and a very good texttext model, as we can just consider the actions to be described in text!
  • This differentiates from several current softwares or systems that capture screenshots at a discrete time point and use it as context.

Another thought is that data synthesis that feeds into generative models helps the model learn to navigate more areas on the optimization landscape. Getting or curating high quality synthetic data is a search problem (as much of search as it is a “learning problem” as said by Prof. Percy Liang.), because we are practically activating as wide of an area as the model allows. This view can be negated by the fact that some of these areas are actually negatives: then, we can negatively sample those for RLAIF datasets, which I’m aware there are already abundant publications on it!

Related Work and Repositories

Self-rewarding LMs https://github.com/lucidrains/self-rewarding-lm-pytorch Online-RLHF https://github.com/RLHFlow/Online-RLHF Self-Align https://github.com/bigcode-project/starcoder2-self-align Cosmopedia https://github.com/huggingface/cosmopedia synthesizing higher quality pretraining data by paraphrasing: https://arxiv.org/pdf/2401.16380