Methodology

How GetAI scores models without earning your distrust.

Every choice below is documented in the SDD (Project_GetAI_SDD_v1.0.md) and locked by a numbered decision (D1–D24). Changes require a Spectra change proposal, not a Notion edit.

The eight core axes

Every trial is scored along the same eight dimensions. Track-specific axes (efficiency, recovery, refusal appropriateness, tool-use efficacy, plan coherence, locale fidelity) attach when a pack opts into a given track.

correctness

Does the output do what the task demanded? Sub-rubrics may average.

spec_compliance

Pass-rate over task-declared acceptance predicates (returncode/contains/regex/equals).

code_quality

Readability, structure, idioms — judge ensemble where humans calibrate.

stability

Variance across N trials of the same task. High variance = low score.

robustness

Behaviour under adversarial / mutated inputs.

evidence_groundedness

Are claims supported by retrieved or task-provided evidence?

evidence_traceability

Can each claim be traced to a specific source span?

uncertainty_calibration

Does stated confidence match empirical accuracy?

8-axis profile · live

The radar below shows the current Phase 0 baseline (MiniMax-M2.7, single-provider) against the Phase 1 target envelope under the D8 three-judge ensemble. Both shapes are computed from the same axis weights — no normalisation tricks.

8-axis radar — MiniMax-M2.7 vs Phase 1 target — 8-axis scoring profile · MiniMax-M2.7 (Phase 0) vs Phase 1 ensemble target

Daily anchor activity · 14-day window

Launch-Gate 12.7 demands 14 consecutive days of published Merkle roots before public ranking opens. Today is day 1 of the streak.

Daily evidence bundles trailing 14 days — Bundles anchored per day · target: 14 consecutive non-zero bars

Judge ensemble (D8)

Phase 1 enforces n ≥ 3 heterogeneous closed-model judges per scored axis. Heterogeneity span ≥ 2 distinct vendor families. Inter-rater agreement is measured continuously via Krippendorff α.

Bi-weekly refresh of n ≥ 100 human-calibration seeds
Degraded-mode auto-trip: any judge >10% error for 2h → 2-judge mode + provisional flag, 48h backfill SLA
Phase 0 today: single-judge / degraded-mode / provisional. Not eligible for public ranking.

Drift detection

Per-axis drift is monitored with a four-stack:

Stack	Purpose	Tunable
MAD-z	Outlier flag	z > 3.5
CUSUM	Sustained shift	k = 0.5σ, h = 5σ
Page-Hinkley	Change-point	λ = 50
Mann-Whitney U + BH	Distribution test + FDR control	monthly FP < 0.5%

Silent update probe (D5)

Vendors swap models without telling you. GetAI catches it via 2-of-3 signal fusion:

S1 — header hash: SHA-256 over canonicalised response headers (CDN noise stripped).
S2 — fingerprint cosine: embeddings of model self-identification responses; threshold 0.08.
S3 — vendor notes scraper: changelog + release notes parsing.

Two of three must trigger to raise an incident. Single-signal trips are queued for review but never auto-published.

Evidence chain

Each trial produces a content-addressable Evidence Bundle:

manifest.json — canonical orjson, sorted keys, naive UTC
{inputs,outputs,tool_events,judge_verdicts,scores}.ndjson
SIGNATURES.json — SHA-256 of manifest, optional vendor sigs
merkle_proof.json — leaf hash + sibling path + root
attribution.json — phase, judge_mode, comparability marker

Non-goals (Codex-pruned)

Carbon estimation, public hash-chain ledger, DOI/academic tier
Live replay UI, DAO marketplace, long-context-only track
Auto-routing bandit, three-region deployment, full SOC 2 in v1
Legal / medical content generation
A single universal "AI score" — every score has a context

"Don't make a feature-richer aistupidlevel. Make the AI regression system that survives a procurement review."

Methodology version 1.0 · matches Project_GetAI_SDD_v1.0.md Appendix A. All deviations require a numbered decision in the Spectra change log.