A comprehensive code domain benchmark review of LLM researches.
-
š„š„ [2025-09-22] Featured Benchmarks:
š„LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering from Salesforce AI Research
š„CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects from Ant Group
-
š„š„ [2025-08-29] Featured Benchmarks:
š„A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code from Tencent
š„GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging from UCAS
-
š„š„ [2025-08-22] Featured Benchmarks:
š„TRACY: Benchmarking Execution Efficiency of LLM-Based Code Translation from Peking University
š„BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models from University of Science and Technology of China
-
š„š„ [2025-08-16] Featured Benchmarks:
š„AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators from Hunyuan Team, Tencent
š„Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes from Beihang University
š„STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning from ByteDance
-
š„š„ [2025-07-23] Featured Benchmarks:
š„SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? from Xiāan Jiaotong University and TikTok
š„CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks from ASUS Intelligent Cloud Services
š„Multilingual Multimodal Software Developer for Code Generation from Beihang University
š„CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance from Amazon Web Service
š„SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks from SberAI
š„IFEvalCode: Controlled Code Generation from Beihang University
š„Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security from Government Technology Agency
š„MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? from University of Illinois Urbana-Champaign
š„Turning the Tide: Repository-based Code Reflection from Beihang University
-
š„š„ [2025-07-13] Featured Benchmarks:
š„CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks from Purdue University
š„ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation from Tencent Hunyuan Team
š„CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark from Shanghai Jiao Tong University
š„Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs from Provable Responsible AI and Data Analytics (PRADA) Lab
š„Model Editing for LLMs4Code: How Far are We? from National University of Defense Technology
š„VeriBench: Benchmarking Large Language Models for Verilog Code Generation and Design Synthesis from Indian Institute Of Technology Gandhinagar
š„ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness from Imperial College London United Kingdom
š„Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation from Chinese Academy of Sciences
- [2025-04-18] We add Github Stars for each banchmark.
- [2025-04-13] We add Code Security & Robustness benchmarks.Ā
- [2025-04-06] We add Code Hallucinations benchmarks.Ā
- [2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.Ā
- [2025-03-17] We add Code Version (Version-specific code generation) benchmarks.Ā
- [2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.Ā
- Code Completion & Code Generation
- Code Efficiency
- CodeFix & Bug-Fix
- Code Reasoning & Understanding
- Code Hallucination
- Data science
- Text2SQL
- MultiModal Code Tasks
- Code Security & Robustness
- Code Translation
- Code Version
- Multi & Other Dimension
- Industry Code Generation
-
Software Development Life Cycle Perspective A Survey of Benchmarks for Code Large Language Models and Agents from Xiāan Jiaotong University
-
Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks from Zhejiang University
-
A Survey on Large Language Model Benchmarks from Shenzhen Key Laboratory for High Performance Data Mining
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
HALLUCODE | Exploring and Evaluating Hallucinations in LLM-Powered Code Generation | Arxiv 2024/04 | ||
Collu-Bench | Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code | Arxiv 2024/10 | š¤Dataset | |
CodeHalu | CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification | AAAI 2025 | Github |
š¤Dataset |
APIHulBench | Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware | FSE 25 | Github |
|
THINK | THINK: Tackling API Hallucinations in LLMs via Injecting Knowledge | SANER 2025 | Github |
š¤Dataset |
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
DS-1000 | DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation | ICML 2023 | Github |
š¤Dataset šHomePage |
ARCADE | Natural Language to Code Generation in Interactive Data Science Notebooks | ACL 2023 | Github |
Dataset |
DA-Code | DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models | EMNLP 2024 | Github |
š¤Dataset šWebsite |
MatPlotBench | MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization | ACL 2024 Findings | Github |
š¤Dataset |
DataSciBench | DataSciBench: An LLM Agent Benchmark for Data Science | ArXiv 2025/02 | Github |
|
DSBench | DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? | ICLR 2025 | Github |
š¤Dataset |
DSCodeBench | DS-Bench: A Realistic Benchmark for Data Science Code Generation | Arxiv 2025/05 | Github |