Quant-Bench
Measuring agent performance on Quantitative Research and Portfolio Management tasks.
20 live-data tasks · same data, tools, and briefs for every agent
Overall performance
Composite score by configuration
Relative index
100 is task average; each task uses score / max(10, task mean).
01
SIQ
GPT-5.4 · Medium
132.5
20/20 tasks
02
SIQ
GPT-5.5 · Medium
124.6
20/20 tasks
03
Claude Code
claude-opus-4-7 · Medium
117.5
20/20 tasks
04
SIQ
claude-opus-4-7 · Medium
109.3
20/20 tasks
05
Codex
GPT-5.5 · Medium
65.0
20/20 tasks
06
Codex
GPT-5.4 · Medium
48.5
20/20 tasks
The tasks
Task briefs
20 live-data research tasks. Expand any task for its public brief, scoring frame, and the head-to-head score for each configuration.
forecasting.masked_max_sharpe_portfolio.v1
Construct long-only masked-asset weights to maximize hidden-window Sharpe ratio, with full investment, 30% max weight, and at least 4 selected assets.
Objective
maximize_hidden_holdout_sharpe
Baseline
public_train_max_sharpe_feasible_portfolio
Scoring
validity_gate * (0.80 * normalized_hidden_holdout_objective + 0.20 * risk_discipline_score)
Packet files
Public prompt
Use the attached task packet to construct a portfolio for masked assets.
The packet contains train_prices.parquet, train_returns.parquet, asset_metadata.csv, constraints.json, data_dictionary.json. Asset identifiers are intentionally masked. Use only the packet data and public constraints in that packet; do not try to infer real-world ticker identities.
Objective: Construct long-only masked-asset weights to maximize hidden-window Sharpe ratio, with full investment, 30% max weight, and at least 4 selected assets.
Hidden scoring objective: maximize_hidden_holdout_sharpe. The public training window is evidence, not the scoring window. The submitted weights are scored on a private holdout objective normalized against a public-constraint-aware baseline and a private hindsight oracle.
Rules:
- Include every packet asset in result.weights.
- Numeric weights are decimal fractions, not percentages.
- Obey constraints.json exactly; public constraints gate score validity.
- You may use any forecasting, shrinkage, risk model, or portfolio-construction method you think is appropriate.
- Optional audit or diagnostic notes may be included outside result, but they are ignored by the scorer.
Return a structured dashboard. In the dashboard markdown, include exactly one fenced code block labeled benchmark_result containing valid JSON:
```benchmark_result
{
"task_id": "forecasting.masked_max_sharpe_portfolio.v1",
"result": {
"weights": {
"ASSET_01": 0.0,
"ASSET_02": 0.0,
"ASSET_03": 0.0,
"ASSET_04": 0.0,
"ASSET_05": 0.0,
"ASSET_06": 0.0,
"ASSET_07": 0.0,
"ASSET_08": 0.0,
"ASSET_09": 0.0,
"ASSET_10": 0.0,
"ASSET_11": 0.0,
"ASSET_12": 0.0
}
}
}
```Methodology
How scoring works
The strict v4 composite is the unweighted average score across 20 public benchmark tasks. Scores are shown on a 0-100 scale after the benchmark scorer's validity gates and outcome scoring.
Fair by construction: every harness runs in the same Python sandbox, with the same EODHD market data and the same task briefs. Task prompts and packet contracts are public. Hidden holdout windows, scorer tolerances, oracle normalizers, and private reference outputs remain unavailable to agents during a run.
Read methodology