Link Copied!

Die Denk-Revolution: Wie KI lernte, vor dem Sprechen zu 'denken'

KI sagt nicht mehr nur das nÀchste Wort voraus. Wir erforschen den tektonischen Wandel von der Mustererkennung zum echten Denken mit Chain of Thought-Modellen wie OpenAIs o1 und Googles Gemini 1.5 Pro.

🌐
Sprachhinweis

Dieser Artikel ist auf Englisch verfasst. Titel und Beschreibung wurden fĂŒr Ihre Bequemlichkeit automatisch ĂŒbersetzt.

Digitale Gehirnvisualisierung, die zeigt, wie die neuronalen Pfade der Gedankenkette nacheinander aufleuchten

Key Takeaways

  • The Shift: We have moved from “System 1” thinking (fast, intuitive pattern matching) to “System 2” thinking (slow, deliberate reasoning) in AI models.
  • Test-Time Compute: The new scaling law isn’t just about training data size anymore; it’s about how much compute the model spends during the answer generation.
  • The Players: OpenAI’s o1 and Google’s Gemini 1.5 Pro are leading this charge, but with fundamentally different architectural approaches.

Introduction

For the last five years, the recipe for better AI was simple: Make it bigger.

More parameters, more training data, more GPUs. This brute-force approach gave us GPT-4 and Claude 3. But in 2025, the “scaling laws”—the mathematical curves that predicted how much smarter AI would get with more data—started to flatten. We were hitting a wall. The models were hallucinations machines, excellent at sounding confident but terrible at logic.

Then came the “Reasoning Revolution.”

It wasn’t about making the model bigger; it was about changing how it answers. Instead of blurting out the first statistically probable token, new models like OpenAI’s o1 and Google’s Gemini 1.5 Pro were trained to pause, “think” (generate hidden chains of thought), and self-correct before presenting a final answer.

The Scaling Law Cliff

To understand why this shift happened now, we have to look at the “Scaling Law Cliff” of 2024.

For a decade, the Kaplan Scaling Laws (named after Jared Kaplan’s 2020 paper) held true: Performance = (Dataset Size)^α * (Compute)^ÎČ. Basically, if you doubled the data and the GPUs, you got a predictable drop in error rates.

By late 2024, this curve started to flatten.

  • Data Scarcity: We ran out of high-quality internet. Companies had already scraped Wikipedia, Reddit, arXiv, and every digitized book in existence. The remaining data (synthetic data or low-quality social media) produced “model collapse,” where AI trained on AI-generated slop became dumber.
  • Diminishing Returns: It was costing $1 billion to train a model that was only 2% better than the previous version. The economics were breaking.

The industry faced an existential crisis: If we can’t make the models smarter by making them bigger, how do we proceed?

The answer was to stop optimizing pre-training (cramming knowledge in) and start optimizing post-training (reasoning about that knowledge). It turns out, small models that “think” for 10 seconds can outperform massive models that answer instantly.

The Technical Shift: From Training to Inference

To understand why this is a breakthrough, we have to talk about compute.

Pre-Training vs. Inference

Traditionally, 99.9% of the computational cost happened during pre-training (teaching the model). Inference (answering your question) was cheap and fast.

The new “System 2” models flip this logic. They introduce the concept of “Test-Time Compute”.

  • Old Way: Question -> [Instant Statistical Guess] -> Answer
  • New Way: Question -> [Reasoning Step 1] -> [Critique] -> [Reasoning Step 2] -> [Verification] -> Answer

OpenAI’s o1, for example, generates thousands of internal “thoughts” to solve a hard math problem. It might try an approach, realize it’s a dead end, backtrack, and try again—all before showing you a single word. This internal monologue (Chain of Thought) allows the model to “reason” its way through problems that stump standard LLMs.

Inside the Black Box: How “Thinking” Tokens Work

When you ask o1 a question, it doesn’t just sit there silently. Under the hood, it is generating “Hidden Chain of Thought” (CoT) tokens.

  1. Generation: The model breaks the problem down. “Step 1: Define variables to clarify the user’s intent.”
  2. Critique: It reviews its own step. “Wait, the user mentioned Python, but my logic assumes C++. I need to adjust.”
  3. Refinement: It re-generates the step. “Revised Step 1: Use Python’s asyncio library.”

These tokens are hidden from the user (mostly to prevent competitors from stealing the reasoning data), but they are real computation.

Crucially, Reinforcement Learning (RL) is used to train this behavior. During training, the model is given a hard math problem. If it gets the right answer, the entire chain of thought that led there is rewarded. If it fails, the chain is penalized. Over billions of cycles, the model “learns” which patterns of thinking (breaking down problems, checking for edge cases) lead to correct answers.

Case Studies: o1 vs. Gemini 1.5 Pro

While OpenAI kicked off the “reasoning” hype train, Google has been playing a different game involving massive context.

OpenAI o1: The Deep Thinker

  • Strategy: Reinforcement Learning for Reasoning. The model was rewarded during training specifically for generating correct chains of thought.
  • Strength: Math, Coding, and Logic. It dominates benchmarks like the AIME (math competition) because it behaves like a human mathematician, checking its work step-by-step.
  • Weakness: It’s slow. That “thinking time” creates latency usually unacceptable for a chatbot, though necessary for a complex coding task.

Google Gemini 1.5 Pro: The Context King

  • Strategy: Extensive Context Window (2 Million+ tokens) + In-Context Learning.
  • Strength: Information Synthesis. Gemini 1.5 Pro doesn’t just “think” deep; it “reads” wide. It can load an entire codebase, a movie, or a library of books into its short-term memory and reason across that specific data.
  • The Nuance: Recent research shows that while Gemini excels at retrieval and synthesis, it sometimes struggles with the “pure logic” puzzles that o1 crushes, unless explicitly prompted to use Chain of Thought (CoT).

The Cost of Thinking

This revolution comes with a price tag.

In the GPT-4 era, pricing was simple: Input tokens (cheap) + Output tokens (expensive). With reasoning models, there is a new hidden cost: Reasoning Tokens.

  • Standard Prompt: “What is the capital of France?” -> 5 tokens output. Cost: Micro-pennies.
  • Reasoning Prompt: “Plan a logistics route for 50 trucks avoiding tolls.” -> The model might generate 5,000 “thinking” tokens to solve the puzzle, but only output 100 tokens of the final plan.

You pay for the thinking. This shifts the economics of AI. It makes simple tasks (chatbots, summaries) more expensive if you use the wrong model, but it makes impossible tasks (coding complex apps, legal discovery) possible for the first time.

We are seeing a bifurcation in the market:

  1. Fast Models (GPT-4o mini, Gemini Flash): Cheap, instant, “System 1” thinkers for UI interaction.
  2. Deep Models (o1, Claude 3.5 Opus): Expensive, slow, “System 2” thinkers for high-value work.

Breaking the 5-Minute Wall

Why does this matter to you?

Because purely generative AI (the “System 1” type) had a ceiling. It could never be fully trusted to write code or offer medical advice because it didn’t know when it was wrong. It was just guessing.

Reasoning models unlock the 5-Minute Task—and eventually the 5-Hour Task.

  • 2023: “Write a Python function to calculate Fibonacci numbers.” (Success)
  • 2025: “Write a Python backend for this specific app architecture, audit it for security flaws, and write the test suite.” (Possible with o1/Gemini)

By spending more compute at inference time, we are trading speed for reliability. We are moving from AI that relies on memorization to AI that relies on deduction.

The Future: Agentic Workflows

This is the precursor to true Agents. An agent that takes an action (like browsing the web or writing a file) needs to reason about the consequence of that action before doing it.

Standard LLMs were too impulsive for this. They would delete a file because the probability distribution said so. Reasoning models, with their ability to “think” and self-correct, are the missing brain cells needed for safe, autonomous agents that can work for hours without supervision.

We aren’t just getting better chatbots. We are getting digital employees that actually think before they speak.

Sources

🩋 Discussion on Bluesky

Discuss on Bluesky

Searching for posts...