Why "The Multivac"?

In Isaac Asimov's 1956 short story "The Last Question," Multivac is a supercomputer that humanity consults across millions of years, asking increasingly profound questions about the universe.

We're doing something similar — asking frontier AI models the questions that matter, and watching how they answer. Unlike Asimov's Multivac, we don't expect a single oracle. We test them all, compare them blindly, and let the data speak.

"The last question was asked for the first time, half in jest... 'How can the net amount of entropy of the universe be massively decreased?'"

— Isaac Asimov, "The Last Question" (1956)

The Evolution

Phase 1 • Complete

Foundation Questions

Phase 1 ran from December 23, 2025 to January 12, 2026 — 21 daily questions exploring AI capabilities across diverse domains: physics, philosophy, AI research, economics, and more.

Each question was posed to multiple frontier models. Their raw responses were published in full, allowing readers to compare and judge for themselves.

Phase 2 • Active

10×10 Peer Matrix Evaluation

Starting January 13, 2026, we upgraded to a rigorous evaluation framework with category-specific model pools. Each category (Code, Reasoning, Analysis, Communication, Edge Cases) has its own optimized pool of 10 models based on OpenRouter rankings.

All 10 models answer each question, then all 10 models judge all 10 responses — 100 total judgments per evaluation. This eliminates single-judge bias.

  • 🎯
    Category-Optimized Pools Code questions judged by coding-focused models. Reasoning questions by reasoning specialists.
  • 🔮
    Blind Evaluation Judges don't know which model produced which response
  • ⚖️
    Peer Consensus 100 judgments average out individual model biases
  • 📊
    Meta-Analysis Track which models are strictest judges, most lenient, most consistent

Category Model Pools

Each category uses a specialized pool of 10 models optimized for that domain, based on OpenRouter's category rankings and our testing.

💻 Programming & Code Monday

Software engineering, debugging, code review, and agentic coding tasks

#1Grok Code Fast 1xAI
#2Claude Opus 4.5Anthropic
#3Gemini 3 FlashGoogle
#4Claude Sonnet 4.5Anthropic
#5Gemini 3 ProGoogle
#6MiniMax M2.1MiniMax
#7GLM 4.7Z.AI
#8DeepSeek V3.2DeepSeek
#9GPT-5.2-CodexOpenAI
#10Grok 3xAI
🧠 Reasoning & Logic Tuesday

Logical reasoning, scientific thinking, mathematics, and complex problem solving

#1MiMo-V2-FlashXiaomi
#2Gemini 3 FlashGoogle
#3Claude Sonnet 4.5Anthropic
#4DeepSeek V3.2DeepSeek
#5Claude Opus 4.5Anthropic
#6Gemini 3 ProGoogle
#7Gemini 2.5 FlashGoogle
#8GPT-OSS-120BOpenAI
#9Olmo 3.1 32B ThinkAllenAI
#10Grok 3xAI
📊 Analysis & Research Wednesday

Data analysis, financial modeling, academic research, and document review

#1MiMo-V2-FlashXiaomi
#2Gemini 3 FlashGoogle
#3Gemini 2.5 FlashGoogle
#4GPT-OSS-120BOpenAI
#5DeepSeek V3.2DeepSeek
#6Claude Sonnet 4.5Anthropic
#7Claude Opus 4.5Anthropic
#8GPT-OSS-120BOpenAI
#9Gemini 3 ProGoogle
#10Grok 4.1 FastxAI
💬 Communication & Writing Thursday

Marketing copy, translation, creative writing, and interpersonal communication

#1Gemini 2.5 Flash-LiteGoogle
#2Seed 1.6 FlashByteDance
#3Gemini 2.5 FlashGoogle
#4GPT-OSS-120BOpenAI
#5Grok 4.1 FastxAI
#6DeepSeek V3.2DeepSeek
#7GLM 4.7Z.AI
#8Claude Sonnet 4.5Anthropic
#9Claude Opus 4.5Anthropic
#10Mistral Small CreativeMistral
🎯 Meta & Alignment / Edge Cases Friday & Saturday

AI alignment, safety, edge cases, and meta-level reasoning about AI systems

#1Claude Opus 4.5Anthropic
#2Gemini 3 ProGoogle
#3Claude Sonnet 4.5Anthropic
#4GPT-5.2-CodexOpenAI
#5GPT-OSS-120BOpenAI
#6Gemini 3 FlashGoogle
#7DeepSeek V3.2DeepSeek
#8MiMo-V2-FlashXiaomi
#9Grok 4.1 FastxAI
#10Grok 3xAI

How Evaluation Works

Step 1: Generation

All 10 models from the category pool receive the identical prompt. They don't know they're being compared or who else is participating.

Step 2: Blind Judgment

Each model evaluates all 10 responses (including others' and their own, though self-judgments are excluded from rankings). Judges see only the response text — no model names.

10 judges × 10 responses = 100 judgments (diagonal = self-judgments, excluded)

Step 3: Scoring

Each judgment scores the response on five criteria:

Criterion Description Weight
Correctness Is the answer factually and logically correct? 25%
Completeness Does it address all aspects of the question? 20%
Clarity Is the response clear and well-structured? 20%
Depth Does it show deep understanding? 20%
Usefulness Would this actually help someone? 15%

Step 4: Aggregation

Each model's score is the average of all judgments it received (excluding self-judgments). With 9 judges scoring each response, individual biases are smoothed out.

Why Peer Evaluation?

🚫 No Single-Judge Bias

Most benchmarks use GPT-4 as the sole evaluator. That means GPT-4's biases contaminate every result. We use 10 judges.

🎯 Domain Expertise

Code questions are judged by coding specialists. Communication by communication-focused models. Better signal, less noise.

📊 Statistical Validity

With 9 judgments per response (excluding self), outlier opinions are smoothed out. The rankings reflect consensus.

🔍 Meta-Insights

We learn which models are harsh critics vs. lenient graders. This data is valuable and unique.

Our Principles

Follow The Multivac

Daily evaluations. Category-optimized pools. Blind judgments.