*relevant papers and projects at the end
Coming into AI headfirst from the scaling law, I immediately had the thought that in order to break the threat of “running out of data”, we have to start making progress in data synthesis.
- However, is the threat of “running out of data” real or warranted? Yes, and yes, but my belief right after reading chinchilla’s wild implications was materially different than my current belief on data synthesis and scaling laws.
There are two fronts to building a good neural network based local assistant that synthetic data will materially accelerate progress:
- cross-modality
- cross-task Tackling each of them requires different areas of expertise.
side note: I feel like saying “synthetic data cannot …” fundamentally forgot chain of thought is a thing, or matter of fact any other prompt engineering.
Two implicit benefits of synthetic data
On synthetic data, Hanchi Sun said it can
- expand the solution manifold of inverse problems, many of which are NP-Hard
- inject inductive bias through data augmentation heuristics.
Specifically about point 1: using a robust, well-defined and liberal synthetic data pipeline, we can bring values to previously unusable raw data that could not fit in conventional learning systems.
- For example, if we want to incept a system that can tell us what we did (description of actions) based on a sequence of screenshots, we can increase the model’s capacity of doing so through synthesizing such by prompting existing capable models by: first generate descriptions for each image, then combine the situation descriptions of each image into one prompt and let a model generate (describe, if you want to anthropomorphize it) what the user did.
- The pipeline involves a good image→text model and a very good text→text model, as we can just consider the actions to be described in text!
- This differentiates from several current softwares or systems that capture screenshots at a discrete time point and use it as context.
Another thought is that data synthesis that feeds into generative models helps the model learn to navigate more areas on the optimization landscape. Getting or curating high quality synthetic data is a search problem (as much of search as it is a “learning problem” as said by Prof. Percy Liang.), because we are practically activating as wide of an area as the model allows. This view can be negated by the fact that some of these areas are actually negatives: then, we can negatively sample those for RLAIF datasets, which I’m aware there are already abundant publications on it!
Related Work and Repositories
Self-rewarding LMs https://github.com/lucidrains/self-rewarding-lm-pytorch Online-RLHF https://github.com/RLHFlow/Online-RLHF Self-Align https://github.com/bigcode-project/starcoder2-self-align Cosmopedia https://github.com/huggingface/cosmopedia synthesizing higher quality pretraining data by paraphrasing: https://arxiv.org/pdf/2401.16380