Official repository of the ICML2025 paper “Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination”
-
Updated
Jun 30, 2025 - Python
Official repository of the ICML2025 paper “Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination”
Replication package for PROBE-SWE: a dynamic benchmark to generate, validate, and analyze data-induced cognitive biases in GPAI on typical software-engineering dilemmas.
Add a description, image, and links to the dynamic-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the dynamic-benchmark topic, visit your repo's landing page and select "manage topics."