In the Era of Auto Research, 47 Open-Ended Tasks Form the Must-Test Benchmark for Agent Capabilities

May 13, 2026

If we place AI into a real-world engineering scenario with no standard answers, can it still survive? For a long time, AI Agents have seemed omnipotent — yet most are merely retrieving answers from existing knowledge bases.

The real engineering world is unforgiving: the stability of underwater robots, lithium plating boundary limits of power batteries, noise control of quantum circuits… None of these problems have a perfect score — only continuous optimization edging ever closer to theoretical limits.

Recently, Naver Labs released Frontier-Eng Bench, an Agent benchmark that officially strips AI of its label as a mere "exam-taking machine".

Instead of having AI tackle outdated coding problems, the research team gave it a complete engineering closed loop: propose a solution, connect to a simulator, process error reports, adjust parameters, and rerun the simulation. Faced with 47 hardcore, interdisciplinary tasks, AI must perform like a senior engineer — finding the optimal solution within the "impossible trinity" of power consumption, safety, and performance. This is more than just a test set; it is a preview of Agent "evolution". When AI begins to learn self-correction through feedback, the Auto Research era — where "humans set goals and AI iterates 24/7" — may be closer than we think.

AI Starts Doing "Real Hard Work"

Previous large language models were more like high-achieving students. You throw a question, and they "retrieve memories" from massive training data, then piece together a seemingly reasonable answer. In this model, large models are essentially playing "word solitaire" rather than solving real-world problems.

Frontier-Eng Bench changes that. AI is now tasked with genuine engineering optimization: first propose a solution, then connect to a simulator to run experiments, obtain feedback and error reports, modify parameters and code, and continue rerunning until performance improves. In this closed-loop system, AI's role undergoes a qualitative shift. Want a more stable underwater robot? AI must automatically tune the controller. Want to push a robotic arm's speed a little further? AI has to run the simulation itself. In this sense, AI has moved beyond pure semantic understanding and begun making continuous optimizations based on real environmental feedback — just like a professional engineer.

The most fascinating aspect of Frontier-Eng Bench is that it does not test whether an AI gets the answer right or wrong, but rather whether it is capable of sustained improvement. Real-world engineering optimization has never been a multiple-choice quiz with a single standard answer.

Take battery fast charging as an example. The goal sounds simple — charge as quickly as possible — yet reality is far more complex. Under strict constraints such as preventing overheating, avoiding overvoltage spikes, limiting accelerated battery degradation, and suppressing lithium plating, the AI must precisely strike an optimal performance balance.

This means AI cannot pass the benchmark through trick-based "exam cramming". Instead, it must demonstrate the endurance for continuous evolution through long-term feedback loops.

So can AI perform long-term optimization in real-world scenarios? The results show GPT-4.5 delivers the most consistent overall performance — yet all models still have a long way to go before fully mastering this benchmark.

Auto Research Has Entered the Era of Iterative Optimization

The research team puts forward a thought-provoking viewpoint: fundamentally, all advanced intelligence relies on long-term feedback closed loops.

The reason AlphaGo defeated Lee Sedol lies in the massive underlying simulations and real-time feedback behind every move — not rote memorization of fixed game records. The same applies to genuine scientific research. Top-tier laboratories do not rely on occasional bursts of inspiration; they continuously formulate hypotheses, run experiments, analyze results, revise solutions, and iterate again.

Engineering optimization follows the same logic. Anyone can deliver the first version; the real challenge lies in achieving that final 1% performance leap.

The significance of Frontier-Eng Bench is profound: for the first time, it systematically evaluates AI's iterative optimization capability, and surfaces two almost ruthless laws of AI evolution.

The First Law: The Further You Iterate, the Harder It Gets to Improve

The paper finds that both the frequency and magnitude of Agent improvements follow a power-law decay:

Improvement frequency ∝ 1 / iteration rounds
Improvement magnitude ∝ 1 / number of improvements

Simply put: gains are largest and fastest in the early rounds, then become increasingly difficult and marginal as iterations proceed. This mirrors real-world R&D — the first version of an AI can quickly pick off plenty of low-hanging fruit, but as it nears performance limits, squeezing out even tiny extra gains demands enormous effort.

The Second Law: Breadth Helps, but Depth Is Indispensable

Running multiple parallel branches can help avoid getting stuck, yet with a fixed budget, every additional parallel chain inevitably reduces depth. Many engineering breakthroughs come only through sustained accumulation and constant revision — structural leaps that cannot be achieved merely by trying more random paths.

This points directly to the development direction of next-generation Agents: no longer models that spit out a one-off answer, but systems capable of continuous iteration and self-evolution within long-range feedback loops.

The Era of the AI Engineer Is Arriving

The far-reaching significance of this research is that it has begun to outline an AI system that closely approximates real-world engineering development cycles. Just imagine AI connected to industrial software, simulation environments, CAD systems, chip design tools, and scientific computing platforms — a fundamental transformation in productivity paradigms is imminent.

Future laboratories will likely adopt a new division of labor: human researchers set the overall directions and goals — cutting a component's energy consumption by 30%, lowering GPU occupancy during model inference, boosting the stability of robot control, or pushing quantum circuit fidelity ever closer to theoretical limits.

AI, meanwhile, takes charge of grinding out the implementation paths — iterating relentlessly around these predefined objectives. It automatically runs simulations and experiments, reads feedback from verifiers and simulators, revises designs, and carries out ongoing optimization around the clock.

This evolutionary logic frees AI from being merely an auxiliary tool. It begins solving complex system problems just like a real engineering team — tirelessly and indefinitely.

The insight revealed by Frontier-Eng Bench is straightforward: now that AI has learned to perform long-term optimization, how far is it from true engineering intelligence?

Back to blog

Item added to your cart