arcade-mcp

History

jottakka 7472b18106 Fixing bug with multiple providers + stats for multiple runs (#752 ) @EricGustin you can use this cli command: ``` uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \ -p openai:gpt-4o,gpt-4o-mini \ -p anthropic:claude-sonnet-4-20250514 \ -k openai:$OPENAI_API_KEY \ -k anthropic:$ANTHROPIC_API_KEY \ -d \ --num-runs 3 \ --seed random \ --multi-run-pass-rule majority \ --max-concurrent 6 \ -o mcp_building_evals_results/results ``` <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Touches core eval execution and all result formatters while adding new CLI inputs and output schema (`run_stats`/`critic_stats` and capture `runs`), so regressions could affect evaluation results and report compatibility despite being additive and validated. > > Overview > Adds multi-run evaluation support to `arcade evals` via new flags `--num-runs`, `--seed`, and `--multi-run-pass-rule`, with upfront validation and plumbing through the CLI runner into eval/capture suite execution. > > Fixes provider selection UX/bug by making `--use-provider/-p` repeatable (instead of a space-delimited string), updates docs/examples accordingly, and extends capture mode to optionally record per-run tool calls (`CapturedRun`) when `num_runs > 1`. > > Enhances all output formatters (HTML/Markdown/Text/JSON) to propagate and display per-case `run_stats` and `critic_stats`, including new HTML UI for run tabs/cards and comparative tables showing mean ± stddev when multi-run data is present. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 2ee1654b7d1fbb9538373507355636164b16a066. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->		2026-02-09 14:25:28 -03:00
..
deploy	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
usage	feat: Support multiple orgs & projects in Arcade CLI (#717 )	2025-12-11 12:58:55 -08:00
test_capture_formatters.py	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00
test_dashboard.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_display.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_evals_runner.py	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00
test_formatter_edge_cases.py	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00
test_formatters.py	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00
test_main_evals.py	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00
test_secret.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_show.py	MCP Local (#563 )	2025-09-25 15:28:15 -07:00
test_utils.py	feat: Support multiple orgs & projects in Arcade CLI (#717 )	2025-12-11 12:58:55 -08:00
test_utils_multi_provider.py	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00