A curated list of papers on reinforcement learning for video generation
-
Updated
Oct 10, 2025
A curated list of papers on reinforcement learning for video generation
[ICLR 2024] SemiReward: A General Reward Model for Semi-supervised Learning
A comrephensive collection of learning from rewards in the post-training and test-time scaling of LLMs, with a focus on both reward models and learning strategies across training, inference, and post-inference stages.
A curated list of awesome resources about reward construction for AI agents. This repository covers cutting-edge research, and practical guides on defining and collecting rewards to build more intelligent and aligned AI agents.
An official implementation of "SPARK: Synergistic Policy And Reward Co-Evolving Framework"
Proposed fuzzy reward model with GRPO to improve VLM's abilities in crowd counting task.
This repository contains the lab work for Coursera course on "Generative AI with Large Language Models".
Official PyTorch Implementation for the "RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling" paper!
Code for ICML 2025 paper "GRAM: A Generative Foundation Reward Model for Reward Generalization"
ACL'25: Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
Developing a LLM response ranking reward model using HFRL except it's GPT-3.5 instead of human.
A curated list of research papers, models, and resources related to R1-style reasoning models following DeepSeek-R1's breakthrough in January 2025.
This project implements a multi-module reward model training system designed to further RL finetune agent performance across complex multi-turn tasks.
“This project implements a mini LLM alignment pipeline using Reinforcement Learning from Human Feedback (RLHF). It includes training a reward model from human-annotated preference data, fine-tuning the language model via policy optimization, and performing ablation studies to evaluate robustness, fairness, and alignment trade-offs.”
Compose, train and test fast LLM routers
A reward model to evaluate machine translations, focusing on English-to-Spanish sentence pairs, with applications in natural language processing (NLP), translation quality assessment, and multilingual content adaptation
Fine-tuning FLAN-T5 with PPO and PEFT to generate less toxic text summaries. This notebook leverages Meta AI's hate speech reward model and utilizes RLHF techniques for improved safety.
Add a description, image, and links to the reward-model topic page so that developers can more easily learn about it.
To associate your repository with the reward-model topic, visit your repo's landing page and select "manage topics."