You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a few directions for OpenEvolve to take:
Train LLMs to express uncertainty (or indeterminance) more clearly and not penalize them completely (an analogy is how SAT MCQs punish against guessing), OR better yet...
Train LLMs to give situational/conditional answers, and create adaptive benchmarks around how answers match up against ground truth, this would require a lot of work since few benchmarks are this flexible
Training LLMs in agentic environments to stop overthinking, favoring explorative analysis, and asking tougher questions
Discovering effective toolkits that force LLM agents to take more investigative action, rather than to overthink and act slowly
Acceleration and remix of "sleep time" and "intuitor" training methods, such that the target performance requires less compute, or that the same compute yields accelerated performance
This is the counterpart of the whole "free range research" goal, since we need to make sure that it does not become ineffective and waste computing time hammering the same mistakes. Here are some coding agent analogues to these problems: #197
Tools that implement Socratic dialogue, such that a smaller LLM with a different role can be a control value for the main LLM, spotting commonn issues of overthinking or improper reasoning https://github.com/im-knots/the-academy
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overthinking has been known to be an issue for reasoning models, so some have hacked around it by creating "self-braking" mechanism, which feels like a bit of bodging when the model is trained on a biased form of GRPO https://github.com/ZJU-REAL/Self-Braking-Tuning
And sometimes even "spurious rewards" are enough for LLMs to self-correct, leading to people creating implicit reward systems similar to mental simulation, and passive rewiring similar to daydreaming. https://github.com/ruixin31/Spurious_Rewards https://github.com/sunblaze-ucb/Intuitor https://github.com/letta-ai/sleep-time-compute
There are a few directions for OpenEvolve to take:
This is the counterpart of the whole "free range research" goal, since we need to make sure that it does not become ineffective and waste computing time hammering the same mistakes. Here are some coding agent analogues to these problems: #197
Cross-reference tianyi-lab/MiP-Overthinking#2
Beta Was this translation helpful? Give feedback.
All reactions