GetAI · The AI benchmark you can audit yourself

How it works

One pipeline. Five proofs.

Every model invocation that lands on the leaderboard travels the same path. Each step is independently verifiable; we publish the cryptographic glue between them so you don't have to take our word for anything.

terminal step 1

Sandboxed call

Deterministic params, captured headers, header-hash baseline.
grading step 2

Predicate eval

8-axis scorers + 3-judge ensemble (Phase 1).
inventory_2 step 3

Evidence bundle

Canonical orjson, SHA-256 manifest, signatures.
account_tree step 4

Merkle anchor

Daily root published 00:00 UTC, RFC 6962-style tree.
verified_user step 5

Public verify

CLI, edge function, third party — same answer.

Why GetAI

Built for procurement-grade decisions.

Most AI benchmarks publish a number. GetAI publishes the number, the prompt, the response bytes, the judge verdicts, the cost snapshot, and a cryptographic proof you can replay six months from now.

workspaces

Tenant-private eval

Distill your support tickets into a private benchmark pack in 48 hours. NDA-bound, RLS-isolated, never on the public board.

radar

Silent update probe

2-of-3 fusion (header hash + fingerprint cosine + vendor notes) catches model swaps your dashboard misses for weeks.

hub

Verifiable evidence chain

SHA-256 content-addressable storage + daily Merkle root + envelope encryption + GDPR tombstones. Every byte accountable.

translate

繁中 vertical packs

Real Taiwan workloads — 發票 OCR · 健保勞保公文 · 客服理賠 · 法遵 — not translated MMLU.

Live leaderboard

Phase 0 baseline. Full cohort joins Phase 1.

One model is currently being measured against the tw-coding-daily-v1 smoke pack. The queued rows below ship in Phase 1 (Q3 2026) under the D8 three-judge ensemble. Phase 0 scores are provisional and not eligible for public ranking until then.

#

Vendor

Model

Score

Bundles

Last seen

Status

Launch-Gate 12.7

14 days of consecutive Merkle roots.

One of the hard pre-GA gates: every day for 14 consecutive days the daily Merkle root must be published and resolvable. Each green cell is a day with at least one bundle anchored.

Trailing 14 days · UTC

… / 14 consecutive days

day with anchored bundle no bundle

Evidence stream

Every bundle, downloadable, replay-verified.

10 most recent bundles in the chain. Click any row to see the full integrity check run live at the edge — Cloudflare fetches the ZIP from R2, recomputes the SHA-256, and reports verified / tampered / missing.

One pipeline. Five proofs.

Sandboxed call

Predicate eval

Evidence bundle

Merkle anchor

Public verify