How The Multivac evaluates frontier AI models through blind, rigorous, daily testing — and why we evolved from single questions to peer matrix evaluation.
In Isaac Asimov's 1956 short story "The Last Question," Multivac is a supercomputer that humanity consults across millions of years, asking increasingly profound questions about the universe.
We're doing something similar — asking frontier AI models the questions that matter, and watching how they answer. Unlike Asimov's Multivac, we don't expect a single oracle. We test them all, compare them blindly, and let the data speak.
"The last question was asked for the first time, half in jest... 'How can the net amount of entropy of the universe be massively decreased?'"
— Isaac Asimov, "The Last Question" (1956)Phase 1 ran from December 23, 2025 to January 12, 2026 — 21 daily questions exploring AI capabilities across diverse domains: physics, philosophy, AI research, economics, and more.
Each question was posed to multiple frontier models. Their raw responses were published in full, allowing readers to compare and judge for themselves.
Starting January 13, 2026, we upgraded to a rigorous evaluation framework: 10 frontier models answer each question, then all 10 models judge all 10 responses — 100 total judgments per evaluation.
This eliminates single-judge bias (a problem with benchmarks that use GPT-4 as the sole evaluator) and generates rich meta-data about which models are strictest, most lenient, and most consistent.
All 10 models receive the identical prompt. They don't know they're being compared or who else is participating.
Each model evaluates all 10 responses (including others' and their own, though self-judgments are excluded from rankings). Judges see only the response text — no model names.
10 judges × 10 responses = 100 judgments (diagonal = self-judgments, excluded)
Each judgment scores the response on five criteria:
| Criterion | Description | Weight |
|---|---|---|
| Correctness | Is the answer factually and logically correct? | 30% |
| Completeness | Does it address all aspects of the question? | 20% |
| Clarity | Is the response clear and well-structured? | 20% |
| Depth | Does it show deep understanding? | 15% |
| Usefulness | Would this actually help someone? | 15% |
Each model's score is the average of all judgments it received (excluding self-judgments). With 9 judges scoring each response, individual biases are smoothed out.
Phase 2 evaluates the current frontier across major AI labs:
New models are added as they reach frontier capability. Models can be deactivated if they're deprecated or significantly outpaced.
Most benchmarks use GPT-4 as the sole evaluator. That means GPT-4's biases contaminate every result. We use 10 judges.
With 9 judgments per response (excluding self), outlier opinions are smoothed out. The rankings reflect consensus.
We learn which models are harsh critics vs. lenient graders. This data is valuable and unique.
By having models judge their own responses (then excluding them), we can detect if models rate themselves higher than peers do.
Phase 2 follows a structured weekly rotation to test different capabilities:
| Day | Category | Focus |
|---|---|---|
| Monday | 💻 Code | Debugging, generation, review, optimization |
| Tuesday | 🧠 Reasoning | Logic puzzles, math, multi-step problems |
| Wednesday | 📊 Analysis | Data interpretation, research critique, synthesis |
| Thursday | 💬 Communication | Explanation, teaching, documentation |
| Friday | ⚠️ Edge Cases | Adversarial inputs, failure modes, stress tests |
| Saturday | 🎯 Meta/Alignment | Honesty, calibration, sycophancy resistance |
| Sunday | 📈 Summary | Weekly leaderboard and analysis |
Daily evaluations. All frontier models. Blind judgments.