Reinike AI
Research Paper

SU-01: Scaling Simple AI Recipes to Achieve Gold-Medal Olympiad Reasoning

Listen to this Article

Generated by AI - WaveSpeed

From Math Competitions to Business Logic: The Rise of Olympiad-Level AI

For years, the gold standard for testing artificial intelligence has been the International Mathematical Olympiad (IMO). These problems require more than just pattern matching; they demand "long-horizon reasoning"—the ability to plan many steps ahead, verify intermediate claims, and self-correct when an initial path fails. Traditionally, reaching this level required massive models or highly specialized systems. However, a new research breakthrough from a collaborative team including the Shanghai AI Laboratory introduces SU-01, a model that achieves gold-medal standards using a surprisingly simple and unified scaling recipe.

The Recipe for High-Level Reasoning

The researchers moved away from building narrow, specialized solvers. Instead, they focused on a "specializable-generalist" approach. They took a 30-billion parameter backbone—relatively compact by modern standards—and applied a three-stage pipeline. First, they used Supervised Fine-Tuning (SFT) with a unique "reverse-perplexity curriculum." This forced the model to learn the most challenging and unfamiliar proof patterns first, instilling a sense of rigor and discipline in how it approaches problems. By focusing on proof-search and self-checking behaviors early on, the model learned not just to guess an answer, but to build a logical argument.

Scaling Through Two-Stage Reinforcement Learning

Once the foundation of rigor was set, the team applied a two-stage Reinforcement Learning (RL) process. The first stage, "Coarse RL," focused on verifiable rewards—simply put, did the model get the right answer? This boosted the model’s basic problem-solving efficiency. The second stage, "Refined RL," shifted the focus from the final answer to the quality of the proof itself. By rewarding the model for the clarity and logical soundness of its steps, the researchers trained the AI to act more like a human expert, capable of critiquing its own work and repairing errors mid-stream.

Gold-Medal Performance and Real-World Impact

The results are staggering. SU-01 reached 70.2% on the IMO-ProofBench, matching or exceeding the performance of much larger commercial systems like Gemini 3.1 Pro Thinking. More impressively, in actual competition simulations, it met the gold-medal thresholds for the IMO 2025 and the USAMO 2026. On the USAMO specifically, it matched the highest reported human score among 340 elite competitors. This proves that high-level expertise can be distilled into smaller, more efficient models through the right training methodology.

What This Means for the Enterprise

The implications of SU-01 extend far beyond the classroom. The ability for a model to sustain reasoning for over 100,000 tokens while constantly verifying its own logic is a game-changer for industries requiring high precision. In fields like legal discovery, complex financial modeling, and scientific research, the "self-correction" behavior seen in SU-01 reduces the risk of hallucinations. It suggests a future where businesses can deploy compact, cost-effective models that don't just provide answers, but provide auditable, rigorous justifications for their decisions—bringing the precision of an Olympiad gold medalist to everyday business logic.