tags:
- “#project-ideas”
- in-progress
I should be able to at least myself “I know RLHF,” and right now I cannot, because I haven’t trained a small model with PPO and DPO.
Implementing both algorithm in Jax should be a reasonable downstream of me taking CSE 257 from UCSD.
Objectives
- Implement an efficient RLHF training pipeline for fine-tuning large language models.
- Address the challenges of high computational cost, long training times, and the need for multiple models during PPO (Proximal Policy Optimization) training.
- Explore techniques to align the reward model with human knowledge and the base model.
- Optimize the data format and preprocessing for PPO training.
Steps
-
Data Preparation:
- Collect a diverse dataset of human-generated text and corresponding feedback/ratings.
- Preprocess the data, handling text cleaning, tokenization, and formatting suitable for RLHF training.
- Develop efficient data loading and batching techniques to handle large-scale datasets.
-
Reward Model Training:
- Train a reward model using supervised learning on the human feedback data.
- Experiment with different architectures (e.g., transformer-based) and techniques to align the reward model with human preferences.
- Evaluate the reward model’s performance and calibrate it to provide meaningful rewards.
-
Efficient PPO Training:
- Implement a PPO training loop for fine-tuning the base language model using the reward model.
- Explore techniques to reduce the computational cost and training time, such as:
- Gradient accumulation and mixed-precision training to utilize GPU memory efficiently.
- Parallel training across multiple GPUs or distributed training across multiple machines.
- Techniques like KL-divergence-based early stopping to speed up convergence.
- Investigate ways to reduce the number of models needed during PPO training, such as parameter sharing or model distillation.
-
Efficient DPO Training:
- Implement a DPO training loop as comparison to PPO.
- Also check the IPO/KPO; this is a good reminder for me to sort the paper folders, at least extract the RLHF papers. I mean, Bai and Ouyang is there.
-
Iterative Refinement:
- Implement a feedback loop to collect additional human feedback on the fine-tuned model’s outputs.
- Use this feedback to further refine the reward model and iterate on the PPO training process.
- Develop metrics and evaluation strategies to assess the model’s alignment with human preferences and its performance on downstream tasks.
Interlude:
- This is a good place to actually reconcile with ConstitutionalAI. It is ironic but intuitive and actually quite reasonable that they pioneered AI-feedback as a safety company. But then, Paul Christiano indirectly led to ChatGPT, so who am I kidding.
-
Deployment and Inference Optimization:
- Optimize the fine-tuned model for efficient inference, considering aspects like model compression, quantization, and parallel inference.
- Develop an API or user interface to allow easy interaction with the trained model.
- Conduct thorough testing and evaluation of the deployed model to ensure its robustness and performance.
-
Documentation and Presentation:
- Document the project extensively, including the methodology, experiments, results, and lessons learned.