systematic-errors

Here is 1 public repository matching this topic...

naholav / claude_4_sonnet_math_evaluation

Comprehensive evaluation of Claude 4 Sonnet's mathematical assessment capabilities: 500 original problems revealing JSON-induced errors and systematic patterns in LLM evaluation tasks. Research demonstrates 100% accuracy on incorrect answers but 84.3% on correct ones due to premature decision-making in JSON structure.

nlp benchmark machine-learning research artificial-intelligence dataset nlp-machine-learning evaluation-metrics cognitive-dissonance mathematical-reasoning llm-evaluation ai-assessment claude-4-sonnet json-bias systematic-errors

Updated Jul 7, 2025
HTML

Improve this page

Add a description, image, and links to the systematic-errors topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the systematic-errors topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systematic-errors

Here is 1 public repository matching this topic...

naholav / claude_4_sonnet_math_evaluation

Improve this page

Add this topic to your repo