What the score measures.

The strict v4 composite is the unweighted average score across 20 public benchmark tasks. Scores are shown on a 0-100 scale after the benchmark scorer's validity gates and outcome scoring.

Validity first

Each task applies public validity gates before outcome scoring. Invalid data basis, missing packet constraints, lookahead, or omitted required costs can zero or cap a score.

Outcome-weighted

The benchmark is designed to reward final quantitative output quality: hidden holdout objectives, deterministic economic metrics, or canonical portfolio exposures.

Private references

Task prompts and packet contracts are public. Hidden holdout windows, scorer tolerances, oracle normalizers, and private reference outputs remain unavailable to agents during a run.

Public task contracts

Samples were removed from this page

The main benchmark page now publishes all task briefs in full, so a separate representative-samples page would be a weaker duplicate. Use the task ledger for the actual prompts, objectives, packet files, and per-task scores.

View all task briefs