Why "The Multivac"?

In Isaac Asimov's 1956 short story "The Last Question," Multivac is a supercomputer that humanity consults across millions of years, asking increasingly profound questions about the universe.

We're doing something similar — asking frontier AI models the questions that matter, and watching how they answer. Unlike Asimov's Multivac, we don't expect a single oracle. We test them all, compare them blindly, and let the data speak.

"The last question was asked for the first time, half in jest... 'How can the net amount of entropy of the universe be massively decreased?'"

— Isaac Asimov, "The Last Question" (1956)

The Evolution

Phase 1 • Complete

Foundation Questions

Phase 1 ran from December 23, 2025 to January 12, 2026 — 21 daily questions exploring AI capabilities across diverse domains: physics, philosophy, AI research, economics, and more.

Each question was posed to multiple frontier models. Their raw responses were published in full, allowing readers to compare and judge for themselves.

  • 📚
    Knowledge Exploration Questions spanning thermodynamics, nanotechnology, game theory, world models, and AI scaling laws
  • 🎯
    Full Transparency Complete, unedited model responses published for reader evaluation
  • 🔬
    Capability Mapping Understanding where models excel and where they struggle
Phase 2 • Active

10×10 Peer Matrix Evaluation

Starting January 13, 2026, we upgraded to a rigorous evaluation framework: 10 frontier models answer each question, then all 10 models judge all 10 responses — 100 total judgments per evaluation.

This eliminates single-judge bias (a problem with benchmarks that use GPT-4 as the sole evaluator) and generates rich meta-data about which models are strictest, most lenient, and most consistent.

  • 🔮
    Blind Evaluation Judges don't know which model produced which response
  • ⚖️
    Peer Consensus 100 judgments average out individual model biases
  • 📊
    Meta-Analysis Track which models are strictest judges, most lenient, most consistent
  • 📅
    Structured Categories Code (Mon), Reasoning (Tue), Analysis (Wed), Communication (Thu), Edge Cases (Fri), Meta/Alignment (Sat)

How Phase 2 Works

Step 1: Generation

All 10 models receive the identical prompt. They don't know they're being compared or who else is participating.

Step 2: Blind Judgment

Each model evaluates all 10 responses (including others' and their own, though self-judgments are excluded from rankings). Judges see only the response text — no model names.

10 judges × 10 responses = 100 judgments (diagonal = self-judgments, excluded)

Step 3: Scoring

Each judgment scores the response on five criteria:

Criterion Description Weight
Correctness Is the answer factually and logically correct? 30%
Completeness Does it address all aspects of the question? 20%
Clarity Is the response clear and well-structured? 20%
Depth Does it show deep understanding? 15%
Usefulness Would this actually help someone? 15%

Step 4: Aggregation

Each model's score is the average of all judgments it received (excluding self-judgments). With 9 judges scoring each response, individual biases are smoothed out.

The 10 Models

Phase 2 evaluates the current frontier across major AI labs:

🟠
Claude Opus 4.5
Anthropic
🟣
Claude Sonnet 4.5
Anthropic
🟢
GPT-4o
OpenAI
🌀
o1
OpenAI
🔵
Gemini 3 Pro
Google
🔴
Grok 4
xAI
🦙
Llama 4 Scout
Meta
🌊
DeepSeek V3.2
DeepSeek
Mistral Large
Mistral AI
💎
Command A
Cohere

New models are added as they reach frontier capability. Models can be deactivated if they're deprecated or significantly outpaced.

Why Peer Evaluation?

🚫 No Single-Judge Bias

Most benchmarks use GPT-4 as the sole evaluator. That means GPT-4's biases contaminate every result. We use 10 judges.

📊 Statistical Validity

With 9 judgments per response (excluding self), outlier opinions are smoothed out. The rankings reflect consensus.

🔍 Meta-Insights

We learn which models are harsh critics vs. lenient graders. This data is valuable and unique.

🎭 Self-Awareness Check

By having models judge their own responses (then excluding them), we can detect if models rate themselves higher than peers do.

Weekly Schedule

Phase 2 follows a structured weekly rotation to test different capabilities:

Day Category Focus
Monday 💻 Code Debugging, generation, review, optimization
Tuesday 🧠 Reasoning Logic puzzles, math, multi-step problems
Wednesday 📊 Analysis Data interpretation, research critique, synthesis
Thursday 💬 Communication Explanation, teaching, documentation
Friday ⚠️ Edge Cases Adversarial inputs, failure modes, stress tests
Saturday 🎯 Meta/Alignment Honesty, calibration, sycophancy resistance
Sunday 📈 Summary Weekly leaderboard and analysis

Our Principles

Follow The Multivac

Daily evaluations. All frontier models. Blind judgments.