How The Multivac evaluates frontier AI models through blind, rigorous, daily testing — using category-specific model pools and peer matrix evaluation.
In Isaac Asimov's 1956 short story "The Last Question," Multivac is a supercomputer that humanity consults across millions of years, asking increasingly profound questions about the universe.
We're doing something similar — asking frontier AI models the questions that matter, and watching how they answer. Unlike Asimov's Multivac, we don't expect a single oracle. We test them all, compare them blindly, and let the data speak.
"The last question was asked for the first time, half in jest... 'How can the net amount of entropy of the universe be massively decreased?'"
— Isaac Asimov, "The Last Question" (1956)Phase 1 ran from December 23, 2025 to January 12, 2026 — 21 daily questions exploring AI capabilities across diverse domains: physics, philosophy, AI research, economics, and more.
Each question was posed to multiple frontier models. Their raw responses were published in full, allowing readers to compare and judge for themselves.
Starting January 13, 2026, we upgraded to a rigorous evaluation framework with category-specific model pools. Each category (Code, Reasoning, Analysis, Communication, Edge Cases) has its own optimized pool of 10 models based on OpenRouter rankings.
All 10 models answer each question, then all 10 models judge all 10 responses — 100 total judgments per evaluation. This eliminates single-judge bias.
Each category uses a specialized pool of 10 models optimized for that domain, based on OpenRouter's category rankings and our testing.
Software engineering, debugging, code review, and agentic coding tasks
Logical reasoning, scientific thinking, mathematics, and complex problem solving
Data analysis, financial modeling, academic research, and document review
Marketing copy, translation, creative writing, and interpersonal communication
All 10 models from the category pool receive the identical prompt. They don't know they're being compared or who else is participating.
Each model evaluates all 10 responses (including others' and their own, though self-judgments are excluded from rankings). Judges see only the response text — no model names.
10 judges × 10 responses = 100 judgments (diagonal = self-judgments, excluded)
Each judgment scores the response on five criteria:
| Criterion | Description | Weight |
|---|---|---|
| Correctness | Is the answer factually and logically correct? | 25% |
| Completeness | Does it address all aspects of the question? | 20% |
| Clarity | Is the response clear and well-structured? | 20% |
| Depth | Does it show deep understanding? | 20% |
| Usefulness | Would this actually help someone? | 15% |
Each model's score is the average of all judgments it received (excluding self-judgments). With 9 judges scoring each response, individual biases are smoothed out.
Most benchmarks use GPT-4 as the sole evaluator. That means GPT-4's biases contaminate every result. We use 10 judges.
Code questions are judged by coding specialists. Communication by communication-focused models. Better signal, less noise.
With 9 judgments per response (excluding self), outlier opinions are smoothed out. The rankings reflect consensus.
We learn which models are harsh critics vs. lenient graders. This data is valuable and unique.
Daily evaluations. Category-optimized pools. Blind judgments.