Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions

🌟Overview

Recent work reports striking but counter-intuitive LLM behaviors—e.g., one-shot training rivals full-dataset performance, noisy rewards suffice, and negative-only samples beat sophisticated reward-based methods—yet the conditions enabling or breaking these claims remain opaque. To this end, we propose three research questions:

How Critical is Reward Signal?
Is One-shot Enough for RL to Work?
Does RL Work with Only Negative Samples?

By answering these questions, we demonstrate that their validity hinges on a single factor: strong Model-Task Alignment, meas by the pretrained model’s pass@k on the target task. Across architectures and domains, we show that standard RL remains universally robust, whereas these surprising results emerge only when alignment is already high; when it is low, the same techniques collapse while RL still succeeds.

👀Get Started

🔧Setup

You need to install verl first, we recommend referring to the official tutorial. After that, you can run the following command to install the other required dependencies:

pip install -r requirements.txt

[Optional] You need to apply for access to the Llama 3.1 model parameters from here, which may take some time.

📚Data Processing

For logical tasks, we use SynLogic-Easy as the training data; you can download the complete dataset here. We provide tools to convert it into the format supported by verl:

python src/logic_data_process.py --sub_set easy --local_dir /your/data/path

We also use DeepScaleR as our training set for math tasks, and you can run the following command to perform the format conversion:

python src/math_data_process.py --file_path /your/data/path/to/deepscaler.json

📉Reproducing

Our study encompasses three research questions and it includes a large number of experiments. We provide scripts in the scripts directory to reproduce them. This directory structure is:

scripts/
├── RQ1/
│   ├── qwen-math-ground_truth.sh
│   ├── qwen-logic-ground_truth.sh
│   ├── llama-math-ground-truth.sh
│   └── ...
├── RQ2/
│   ├── qwen-math.sh
│   ├── qwen-logic.sh
│   ├── llama-math.sh
│   └── llama-logic.sh
└── RQ3/
    ├── qwen-math-nsr.sh
    ├── qwen-math-psr.sh
    ├── qwen-logic-nsr.sh
    └── ...

You can examine the specific script files to obtain more details. To run these scripts successfully, you may need to replace certain parts of the code—for example, you’ll have to provide the path to your own dataset.

🚩Citation

If you find this repository helpful, feel free to cite our paper:

@misc{wu2025miragemethodmodeltaskalignment,
      title={Mirage or Method? How Model-Task Alignment Induces Divergent RL Conclusions}, 
      author={Haoze Wu and Cheng Wang and Wenshuo Zhao and Junxian He},
      year={2025},
      eprint={2508.21188},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.21188}, 
}

🌻Acknowledgement

Our experiments are implemented on the basis of verl. In particular, we would like to thank SynLogic for providing the invaluable data and verifiers for the logical tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
SynLogic		SynLogic
assets		assets
scripts		scripts
src		src
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions

Table of contents

🌟Overview

👀Get Started

🔧Setup

📚Data Processing

📉Reproducing

🚩Citation

🌻Acknowledgement

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

hkust-nlp/model-task-align-rl

Folders and files

Latest commit

History

Repository files navigation

Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions

Table of contents

🌟Overview

👀Get Started

🔧Setup

📚Data Processing

📉Reproducing

🚩Citation

🌻Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages