Key Takeaways
- The Shift: We have moved from âSystem 1â thinking (fast, intuitive pattern matching) to âSystem 2â thinking (slow, deliberate reasoning) in AI models.
- Test-Time Compute: The new scaling law isnât just about training data size anymore; itâs about how much compute the model spends during the answer generation.
- The Players: OpenAIâs o1 and Googleâs Gemini 1.5 Pro are leading this charge, but with fundamentally different architectural approaches.
Introduction
For the last five years, the recipe for better AI was simple: Make it bigger.
More parameters, more training data, more GPUs. This brute-force approach gave us GPT-4 and Claude 3. But in 2025, the âscaling lawsââthe mathematical curves that predicted how much smarter AI would get with more dataâstarted to flatten. We were hitting a wall. The models were hallucinations machines, excellent at sounding confident but terrible at logic.
Then came the âReasoning Revolution.â
It wasnât about making the model bigger; it was about changing how it answers. Instead of blurting out the first statistically probable token, new models like OpenAIâs o1 and Googleâs Gemini 1.5 Pro were trained to pause, âthinkâ (generate hidden chains of thought), and self-correct before presenting a final answer.
The Scaling Law Cliff
To understand why this shift happened now, we have to look at the âScaling Law Cliffâ of 2024.
For a decade, the Kaplan Scaling Laws (named after Jared Kaplanâs 2020 paper) held true: Performance = (Dataset Size)^α * (Compute)^ÎČ. Basically, if you doubled the data and the GPUs, you got a predictable drop in error rates.
By late 2024, this curve started to flatten.
- Data Scarcity: We ran out of high-quality internet. Companies had already scraped Wikipedia, Reddit, arXiv, and every digitized book in existence. The remaining data (synthetic data or low-quality social media) produced âmodel collapse,â where AI trained on AI-generated slop became dumber.
- Diminishing Returns: It was costing $1 billion to train a model that was only 2% better than the previous version. The economics were breaking.
The industry faced an existential crisis: If we canât make the models smarter by making them bigger, how do we proceed?
The answer was to stop optimizing pre-training (cramming knowledge in) and start optimizing post-training (reasoning about that knowledge). It turns out, small models that âthinkâ for 10 seconds can outperform massive models that answer instantly.
The Technical Shift: From Training to Inference
To understand why this is a breakthrough, we have to talk about compute.
Pre-Training vs. Inference
Traditionally, 99.9% of the computational cost happened during pre-training (teaching the model). Inference (answering your question) was cheap and fast.
The new âSystem 2â models flip this logic. They introduce the concept of âTest-Time Computeâ.
- Old Way: Question -> [Instant Statistical Guess] -> Answer
- New Way: Question -> [Reasoning Step 1] -> [Critique] -> [Reasoning Step 2] -> [Verification] -> Answer
OpenAIâs o1, for example, generates thousands of internal âthoughtsâ to solve a hard math problem. It might try an approach, realize itâs a dead end, backtrack, and try againâall before showing you a single word. This internal monologue (Chain of Thought) allows the model to âreasonâ its way through problems that stump standard LLMs.
Inside the Black Box: How âThinkingâ Tokens Work
When you ask o1 a question, it doesnât just sit there silently. Under the hood, it is generating âHidden Chain of Thoughtâ (CoT) tokens.
- Generation: The model breaks the problem down. âStep 1: Define variables to clarify the userâs intent.â
- Critique: It reviews its own step. âWait, the user mentioned Python, but my logic assumes C++. I need to adjust.â
- Refinement: It re-generates the step. âRevised Step 1: Use Pythonâs
asynciolibrary.â
These tokens are hidden from the user (mostly to prevent competitors from stealing the reasoning data), but they are real computation.
Crucially, Reinforcement Learning (RL) is used to train this behavior. During training, the model is given a hard math problem. If it gets the right answer, the entire chain of thought that led there is rewarded. If it fails, the chain is penalized. Over billions of cycles, the model âlearnsâ which patterns of thinking (breaking down problems, checking for edge cases) lead to correct answers.
Case Studies: o1 vs. Gemini 1.5 Pro
While OpenAI kicked off the âreasoningâ hype train, Google has been playing a different game involving massive context.
OpenAI o1: The Deep Thinker
- Strategy: Reinforcement Learning for Reasoning. The model was rewarded during training specifically for generating correct chains of thought.
- Strength: Math, Coding, and Logic. It dominates benchmarks like the AIME (math competition) because it behaves like a human mathematician, checking its work step-by-step.
- Weakness: Itâs slow. That âthinking timeâ creates latency usually unacceptable for a chatbot, though necessary for a complex coding task.
Google Gemini 1.5 Pro: The Context King
- Strategy: Extensive Context Window (2 Million+ tokens) + In-Context Learning.
- Strength: Information Synthesis. Gemini 1.5 Pro doesnât just âthinkâ deep; it âreadsâ wide. It can load an entire codebase, a movie, or a library of books into its short-term memory and reason across that specific data.
- The Nuance: Recent research shows that while Gemini excels at retrieval and synthesis, it sometimes struggles with the âpure logicâ puzzles that o1 crushes, unless explicitly prompted to use Chain of Thought (CoT).
The Cost of Thinking
This revolution comes with a price tag.
In the GPT-4 era, pricing was simple: Input tokens (cheap) + Output tokens (expensive). With reasoning models, there is a new hidden cost: Reasoning Tokens.
- Standard Prompt: âWhat is the capital of France?â -> 5 tokens output. Cost: Micro-pennies.
- Reasoning Prompt: âPlan a logistics route for 50 trucks avoiding tolls.â -> The model might generate 5,000 âthinkingâ tokens to solve the puzzle, but only output 100 tokens of the final plan.
You pay for the thinking. This shifts the economics of AI. It makes simple tasks (chatbots, summaries) more expensive if you use the wrong model, but it makes impossible tasks (coding complex apps, legal discovery) possible for the first time.
We are seeing a bifurcation in the market:
- Fast Models (GPT-4o mini, Gemini Flash): Cheap, instant, âSystem 1â thinkers for UI interaction.
- Deep Models (o1, Claude 3.5 Opus): Expensive, slow, âSystem 2â thinkers for high-value work.
Breaking the 5-Minute Wall
Why does this matter to you?
Because purely generative AI (the âSystem 1â type) had a ceiling. It could never be fully trusted to write code or offer medical advice because it didnât know when it was wrong. It was just guessing.
Reasoning models unlock the 5-Minute Taskâand eventually the 5-Hour Task.
- 2023: âWrite a Python function to calculate Fibonacci numbers.â (Success)
- 2025: âWrite a Python backend for this specific app architecture, audit it for security flaws, and write the test suite.â (Possible with o1/Gemini)
By spending more compute at inference time, we are trading speed for reliability. We are moving from AI that relies on memorization to AI that relies on deduction.
The Future: Agentic Workflows
This is the precursor to true Agents. An agent that takes an action (like browsing the web or writing a file) needs to reason about the consequence of that action before doing it.
Standard LLMs were too impulsive for this. They would delete a file because the probability distribution said so. Reasoning models, with their ability to âthinkâ and self-correct, are the missing brain cells needed for safe, autonomous agents that can work for hours without supervision.
We arenât just getting better chatbots. We are getting digital employees that actually think before they speak.
đŠ Discussion on Bluesky
Discuss on Bluesky