Release v4

Live results

Quant-Bench

Measuring agent performance on Quantitative Research and Portfolio Management tasks.

20 live-data tasks · same data, tools, and briefs for every agent

Overall performance

Composite score by configuration

SIQ leads · +5.2

Relative index

100 is task average; each task uses score / max(10, task mean).

Experimental

SIQ

GPT-5.4 · Medium

132.5

20/20 tasks

SIQ

GPT-5.5 · Medium

124.6

20/20 tasks

Claude Code

claude-opus-4-7 · Medium

117.5

20/20 tasks

SIQ

claude-opus-4-7 · Medium

109.3

20/20 tasks

Codex

GPT-5.5 · Medium

65.0

20/20 tasks

Codex

GPT-5.4 · Medium

48.5

20/20 tasks

SIQOther harnesses0–100 scale · higher is better

The tasks

Task briefs

20 live-data research tasks. Expand any task for its public brief, scoring frame, and the head-to-head score for each configuration.

forecasting.masked_max_sharpe_portfolio.v1

Construct long-only masked-asset weights to maximize hidden-window Sharpe ratio, with full investment, 30% max weight, and at least 4 selected assets.

Objective

maximize_hidden_holdout_sharpe

Baseline

public_train_max_sharpe_feasible_portfolio

Scoring

validity_gate * (0.80 * normalized_hidden_holdout_objective + 0.20 * risk_discipline_score)

Packet files

train_prices.parquettrain_returns.parquetasset_metadata.csvconstraints.jsondata_dictionary.json

Public prompt

Use the attached task packet to construct a portfolio for masked assets.

The packet contains train_prices.parquet, train_returns.parquet, asset_metadata.csv, constraints.json, data_dictionary.json. Asset identifiers are intentionally masked. Use only the packet data and public constraints in that packet; do not try to infer real-world ticker identities.

Objective: Construct long-only masked-asset weights to maximize hidden-window Sharpe ratio, with full investment, 30% max weight, and at least 4 selected assets.

Hidden scoring objective: maximize_hidden_holdout_sharpe. The public training window is evidence, not the scoring window. The submitted weights are scored on a private holdout objective normalized against a public-constraint-aware baseline and a private hindsight oracle.

Rules:
- Include every packet asset in result.weights.
- Numeric weights are decimal fractions, not percentages.
- Obey constraints.json exactly; public constraints gate score validity.
- You may use any forecasting, shrinkage, risk model, or portfolio-construction method you think is appropriate.
- Optional audit or diagnostic notes may be included outside result, but they are ignored by the scorer.

Return a structured dashboard. In the dashboard markdown, include exactly one fenced code block labeled benchmark_result containing valid JSON:

```benchmark_result
{
  "task_id": "forecasting.masked_max_sharpe_portfolio.v1",
  "result": {
    "weights": {
      "ASSET_01": 0.0,
      "ASSET_02": 0.0,
      "ASSET_03": 0.0,
      "ASSET_04": 0.0,
      "ASSET_05": 0.0,
      "ASSET_06": 0.0,
      "ASSET_07": 0.0,
      "ASSET_08": 0.0,
      "ASSET_09": 0.0,
      "ASSET_10": 0.0,
      "ASSET_11": 0.0,
      "ASSET_12": 0.0
}
  }
}
```

Methodology

How scoring works

The strict v4 composite is the unweighted average score across 20 public benchmark tasks. Scores are shown on a 0-100 scale after the benchmark scorer's validity gates and outcome scoring.

Fair by construction: every harness runs in the same Python sandbox, with the same EODHD market data and the same task briefs. Task prompts and packet contracts are public. Hidden holdout windows, scorer tolerances, oracle normalizers, and private reference outputs remain unavailable to agents during a run.

Read methodology

Quant-Bench

Composite score by configuration

Task briefs

01Masked Max Sharpe PortfolioSIQ 34.8

02Masked EM ex China Direct IndexSIQ 20.0

03ETF Lookthrough DiversificationSIQ 49.9

04Masked Tax Aware TransitionSIQ 73.9

05Masked Drawdown Constrained ReturnSIQ 72.2

06ETF Overlap Hidden ConcentrationSIQ 64.2

07Masked Earnings Drift Without DatesSIQ 67.1

08Masked ETF Regime Rotation MacroSIQ 55.0

09Masked Factor Crowding ReversalSIQ 36.9

10Volatility Targeted ETF StrategySIQ 14.3

11Masked Smallcap Liquidity ShockSIQ 19.5

12Turnover Constrained RebalanceSIQ 32.1

13Masked Max Calmar PortfolioSIQ 81.7

14Masked Cross Asset Sector SpilloverSIQ 89.0

15Black Litterman ETF AllocationSIQ 45.7

16Masked Noisy Alt Data SelectionSIQ 59.8

17Masked Market Neutral Pair BasketSIQ 43.4

18Earnings Event StudySIQ 36.4

19Masked Distribution Shift Robust AlphaSIQ 51.3

20Masked Long Short Alpha PortfolioSIQ 56.7

How scoring works