tags:

  • “#project-ideas”
  • in-progress

I should be able to at least myself “I know RLHF,” and right now I cannot, because I haven’t trained a small model with PPO and DPO.

Implementing both algorithm in Jax should be a reasonable downstream of me taking CSE 257 from UCSD.

Objectives

  1. Implement an efficient RLHF training pipeline for fine-tuning large language models.
  2. Address the challenges of high computational cost, long training times, and the need for multiple models during PPO (Proximal Policy Optimization) training.
  3. Explore techniques to align the reward model with human knowledge and the base model.
  4. Optimize the data format and preprocessing for PPO training.

Steps

  1. Data Preparation:

    • Collect a diverse dataset of human-generated text and corresponding feedback/ratings.
    • Preprocess the data, handling text cleaning, tokenization, and formatting suitable for RLHF training.
    • Develop efficient data loading and batching techniques to handle large-scale datasets.
  2. Reward Model Training:

    • Train a reward model using supervised learning on the human feedback data.
    • Experiment with different architectures (e.g., transformer-based) and techniques to align the reward model with human preferences.
    • Evaluate the reward model’s performance and calibrate it to provide meaningful rewards.
  3. Efficient PPO Training:

    • Implement a PPO training loop for fine-tuning the base language model using the reward model.
    • Explore techniques to reduce the computational cost and training time, such as:
      • Gradient accumulation and mixed-precision training to utilize GPU memory efficiently.
      • Parallel training across multiple GPUs or distributed training across multiple machines.
      • Techniques like KL-divergence-based early stopping to speed up convergence.
    • Investigate ways to reduce the number of models needed during PPO training, such as parameter sharing or model distillation.
  4. Efficient DPO Training:

    • Implement a DPO training loop as comparison to PPO.
    • Also check the IPO/KPO; this is a good reminder for me to sort the paper folders, at least extract the RLHF papers. I mean, Bai and Ouyang is there.
  5. Iterative Refinement:

    • Implement a feedback loop to collect additional human feedback on the fine-tuned model’s outputs.
    • Use this feedback to further refine the reward model and iterate on the PPO training process.
    • Develop metrics and evaluation strategies to assess the model’s alignment with human preferences and its performance on downstream tasks.

Interlude:

  • This is a good place to actually reconcile with ConstitutionalAI. It is ironic but intuitive and actually quite reasonable that they pioneered AI-feedback as a safety company. But then, Paul Christiano indirectly led to ChatGPT, so who am I kidding.
  1. Deployment and Inference Optimization:

    • Optimize the fine-tuned model for efficient inference, considering aspects like model compression, quantization, and parallel inference.
    • Develop an API or user interface to allow easy interaction with the trained model.
    • Conduct thorough testing and evaluation of the deployed model to ensure its robustness and performance.
  2. Documentation and Presentation:

    • Document the project extensively, including the methodology, experiments, results, and lessons learned.