arcade-mcp/libs/tests/sdk
jottakka 7472b18106
Fixing bug with multiple providers + stats for multiple runs (#752)
@EricGustin you can use this cli command:
```
uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \
    -p openai:gpt-4o,gpt-4o-mini \
    -p anthropic:claude-sonnet-4-20250514 \
    -k openai:$OPENAI_API_KEY \
    -k anthropic:$ANTHROPIC_API_KEY \
    -d \
    --num-runs 3 \
    --seed random \
    --multi-run-pass-rule majority \
    --max-concurrent 6 \
    -o mcp_building_evals_results/results

```

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches core eval execution and all result formatters while adding new
CLI inputs and output schema (`run_stats`/`critic_stats` and capture
`runs`), so regressions could affect evaluation results and report
compatibility despite being additive and validated.
> 
> **Overview**
> Adds **multi-run evaluation support** to `arcade evals` via new flags
`--num-runs`, `--seed`, and `--multi-run-pass-rule`, with upfront
validation and plumbing through the CLI runner into eval/capture suite
execution.
> 
> Fixes provider selection UX/bug by making `--use-provider/-p`
**repeatable** (instead of a space-delimited string), updates
docs/examples accordingly, and extends capture mode to optionally record
**per-run tool calls** (`CapturedRun`) when `num_runs > 1`.
> 
> Enhances all output formatters (HTML/Markdown/Text/JSON) to
**propagate and display** per-case `run_stats` and `critic_stats`,
including new HTML UI for run tabs/cards and comparative tables showing
mean ± stddev when multi-run data is present.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
2ee1654b7d1fbb9538373507355636164b16a066. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-02-09 14:25:28 -03:00
..
__init__.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_errors.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_eval.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_eval_anthropic.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_eval_capture.py Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00
test_eval_critic.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_eval_multi_run.py Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00
test_evalsuite_convenience.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_fuzzy_weight.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_google_adapter.py Tool Error Handling (#539) 2025-09-10 10:45:18 -07:00
test_graphql_adapter.py [TOO-145]GQL Error Adaptor (#692) 2025-11-25 23:00:26 -03:00
test_graphql_tooling.py [TOO-145]GQL Error Adaptor (#692) 2025-11-25 23:00:26 -03:00
test_httpx_adapter.py Fixing 403 error always being RateLimiting (#685) 2025-11-13 22:20:32 -03:00
test_loaders.py Adding MCP Servers supports to Arcade Evals (#689) 2026-01-07 20:26:23 -03:00
test_microsoft_adapter.py [READY][PROD-215][TDK] Adding MS error adapter (#575) 2025-09-24 09:04:42 -03:00
test_slack_adapter.py [READY][PROD-215][TDK] Adding Slack error adaptor (#577) 2025-09-24 17:26:37 -03:00
test_tool_decorator.py PagerDuty typed OAuth object (#718) 2025-12-15 17:42:11 -03:00