arcade-mcp

History

jottakka 7472b18106 Fixing bug with multiple providers + stats for multiple runs (#752 ) @EricGustin you can use this cli command: ``` uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \ -p openai:gpt-4o,gpt-4o-mini \ -p anthropic:claude-sonnet-4-20250514 \ -k openai:$OPENAI_API_KEY \ -k anthropic:$ANTHROPIC_API_KEY \ -d \ --num-runs 3 \ --seed random \ --multi-run-pass-rule majority \ --max-concurrent 6 \ -o mcp_building_evals_results/results ``` <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Touches core eval execution and all result formatters while adding new CLI inputs and output schema (`run_stats`/`critic_stats` and capture `runs`), so regressions could affect evaluation results and report compatibility despite being additive and validated. > > Overview > Adds multi-run evaluation support to `arcade evals` via new flags `--num-runs`, `--seed`, and `--multi-run-pass-rule`, with upfront validation and plumbing through the CLI runner into eval/capture suite execution. > > Fixes provider selection UX/bug by making `--use-provider/-p` repeatable (instead of a space-delimited string), updates docs/examples accordingly, and extends capture mode to optionally record per-run tool calls (`CapturedRun`) when `num_runs > 1`. > > Enhances all output formatters (HTML/Markdown/Text/JSON) to propagate and display per-case `run_stats` and `critic_stats`, including new HTML UI for run tabs/cards and comparative tables showing mean ± stddev when multi-run data is present. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 2ee1654b7d1fbb9538373507355636164b16a066. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->		2026-02-09 14:25:28 -03:00
..
__init__.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_errors.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_eval.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_eval_anthropic.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_eval_capture.py	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00
test_eval_critic.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_eval_multi_run.py	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00
test_evalsuite_convenience.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_fuzzy_weight.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_google_adapter.py	Tool Error Handling (#539 )	2025-09-10 10:45:18 -07:00
test_graphql_adapter.py	[TOO-145]GQL Error Adaptor (#692 )	2025-11-25 23:00:26 -03:00
test_graphql_tooling.py	[TOO-145]GQL Error Adaptor (#692 )	2025-11-25 23:00:26 -03:00
test_httpx_adapter.py	Fixing 403 error always being RateLimiting (#685 )	2025-11-13 22:20:32 -03:00
test_loaders.py	Adding MCP Servers supports to Arcade Evals (#689 )	2026-01-07 20:26:23 -03:00
test_microsoft_adapter.py	[READY][PROD-215][TDK] Adding MS error adapter (#575 )	2025-09-24 09:04:42 -03:00
test_slack_adapter.py	[READY][PROD-215][TDK] Adding Slack error adaptor (#577 )	2025-09-24 17:26:37 -03:00
test_tool_decorator.py	PagerDuty typed OAuth object (#718 )	2025-12-15 17:42:11 -03:00