arcade-mcp/libs/tests/cli
jottakka 7472b18106
Fixing bug with multiple providers + stats for multiple runs (#752)
@EricGustin you can use this cli command:
```
uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \
    -p openai:gpt-4o,gpt-4o-mini \
    -p anthropic:claude-sonnet-4-20250514 \
    -k openai:$OPENAI_API_KEY \
    -k anthropic:$ANTHROPIC_API_KEY \
    -d \
    --num-runs 3 \
    --seed random \
    --multi-run-pass-rule majority \
    --max-concurrent 6 \
    -o mcp_building_evals_results/results

```

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches core eval execution and all result formatters while adding new
CLI inputs and output schema (`run_stats`/`critic_stats` and capture
`runs`), so regressions could affect evaluation results and report
compatibility despite being additive and validated.
> 
> **Overview**
> Adds **multi-run evaluation support** to `arcade evals` via new flags
`--num-runs`, `--seed`, and `--multi-run-pass-rule`, with upfront
validation and plumbing through the CLI runner into eval/capture suite
execution.
> 
> Fixes provider selection UX/bug by making `--use-provider/-p`
**repeatable** (instead of a space-delimited string), updates
docs/examples accordingly, and extends capture mode to optionally record
**per-run tool calls** (`CapturedRun`) when `num_runs > 1`.
> 
> Enhances all output formatters (HTML/Markdown/Text/JSON) to
**propagate and display** per-case `run_stats` and `critic_stats`,
including new HTML UI for run tabs/cards and comparative tables showing
mean ± stddev when multi-run data is present.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
2ee1654b7d1fbb9538373507355636164b16a066. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-02-09 14:25:28 -03:00
..
deploy Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
usage feat: Support multiple orgs & projects in Arcade CLI (#717) 2025-12-11 12:58:55 -08:00
test_capture_formatters.py Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00
test_dashboard.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_display.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_evals_runner.py Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00
test_formatter_edge_cases.py Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00
test_formatters.py Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00
test_main_evals.py Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00
test_secret.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_show.py MCP Local (#563) 2025-09-25 15:28:15 -07:00
test_utils.py feat: Support multiple orgs & projects in Arcade CLI (#717) 2025-12-11 12:58:55 -08:00
test_utils_multi_provider.py Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00