History

jottakka 7472b18106 Fixing bug with multiple providers + stats for multiple runs (#752 ) @EricGustin you can use this cli command: ``` uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \ -p openai:gpt-4o,gpt-4o-mini \ -p anthropic:claude-sonnet-4-20250514 \ -k openai:$OPENAI_API_KEY \ -k anthropic:$ANTHROPIC_API_KEY \ -d \ --num-runs 3 \ --seed random \ --multi-run-pass-rule majority \ --max-concurrent 6 \ -o mcp_building_evals_results/results ``` <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Touches core eval execution and all result formatters while adding new CLI inputs and output schema (`run_stats`/`critic_stats` and capture `runs`), so regressions could affect evaluation results and report compatibility despite being additive and validated. > > Overview > Adds multi-run evaluation support to `arcade evals` via new flags `--num-runs`, `--seed`, and `--multi-run-pass-rule`, with upfront validation and plumbing through the CLI runner into eval/capture suite execution. > > Fixes provider selection UX/bug by making `--use-provider/-p` repeatable (instead of a space-delimited string), updates docs/examples accordingly, and extends capture mode to optionally record per-run tool calls (`CapturedRun`) when `num_runs > 1`. > > Enhances all output formatters (HTML/Markdown/Text/JSON) to propagate and display per-case `run_stats` and `critic_stats`, including new HTML UI for run tabs/cards and comparative tables showing mean ± stddev when multi-run data is present. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 2ee1654b7d1fbb9538373507355636164b16a066. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-02-09 14:25:28 -03:00
..
arcade_evals	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00
README.md	Fixing bug with multiple providers + stats for multiple runs (#752 )	2026-02-09 14:25:28 -03:00

Fixing bug with multiple providers + stats for multiple runs (#752 )

@EricGustin you can use this cli command:
```
uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \
    -p openai:gpt-4o,gpt-4o-mini \
    -p anthropic:claude-sonnet-4-20250514 \
    -k openai:$OPENAI_API_KEY \
    -k anthropic:$ANTHROPIC_API_KEY \
    -d \
    --num-runs 3 \
    --seed random \
    --multi-run-pass-rule majority \
    --max-concurrent 6 \
    -o mcp_building_evals_results/results

```

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches core eval execution and all result formatters while adding new
CLI inputs and output schema (`run_stats`/`critic_stats` and capture
`runs`), so regressions could affect evaluation results and report
compatibility despite being additive and validated.
> 
> **Overview**
> Adds **multi-run evaluation support** to `arcade evals` via new flags
`--num-runs`, `--seed`, and `--multi-run-pass-rule`, with upfront
validation and plumbing through the CLI runner into eval/capture suite
execution.
> 
> Fixes provider selection UX/bug by making `--use-provider/-p`
**repeatable** (instead of a space-delimited string), updates
docs/examples accordingly, and extends capture mode to optionally record
**per-run tool calls** (`CapturedRun`) when `num_runs > 1`.
> 
> Enhances all output formatters (HTML/Markdown/Text/JSON) to
**propagate and display** per-case `run_stats` and `critic_stats`,
including new HTML UI for run tabs/cards and comparative tables showing
mean ± stddev when multi-run data is present.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
2ee1654b7d1fbb9538373507355636164b16a066. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

2026-02-09 14:25:28 -03:00

arcade_evals

Fixing bug with multiple providers + stats for multiple runs (#752 )

2026-02-09 14:25:28 -03:00

README.md

Fixing bug with multiple providers + stats for multiple runs (#752 )

2026-02-09 14:25:28 -03:00

README.md

Arcade Evals

Evaluation toolkit for testing Arcade tools.

Overview

Arcade Evals provides comprehensive evaluation capabilities for Arcade tools:

Evaluation Framework: Cases, suites, and rubrics for systematic testing
Critics: Different types of comparisons (binary, numeric, similarity, datetime)
Tool Evaluation: Decorators and utilities for evaluating tool performance
Multi-Run Statistics: Run each case multiple times with configurable seed policies and pass rules to measure consistency
Comparative Evaluation: Compare tool performance across multiple sources/tracks side-by-side
Capture Mode: Record model tool calls without scoring for debugging and baseline generation
Result Analysis: Comprehensive evaluation results and reporting in multiple formats (text, markdown, HTML, JSON)

Installation

pip install 'arcade-mcp[evals]'

Usage

Basic Evaluation

from arcade_evals import EvalCase, EvalSuite, tool_eval

# Create evaluation cases
case1 = EvalCase(
    input={"query": "What is 2+2?"},
    expected_output="4"
)

case2 = EvalCase(
    input={"query": "What is the capital of France?"},
    expected_output="Paris"
)

# Create evaluation suite
suite = EvalSuite(cases=[case1, case2])

# Evaluate a tool
@tool_eval(suite)
def my_calculator(query: str) -> str:
    # Tool implementation
    return "4" if "2+2" in query else "Unknown"

Using Critics

from arcade_evals import NumericCritic, SimilarityCritic

# Numeric comparison
numeric_critic = NumericCritic(tolerance=0.1)
result = numeric_critic.evaluate(expected=10.0, actual=10.05)

# Similarity comparison
similarity_critic = SimilarityCritic(threshold=0.8)
result = similarity_critic.evaluate(
    expected="The capital of France is Paris",
    actual="Paris is the capital of France"
)

Advanced Evaluation

from arcade_evals import EvalRubric, ExpectedToolCall

# Create rubric with tool calls
rubric = EvalRubric(
    expected_tool_calls=[
        ExpectedToolCall(
            tool_name="calculator",
            parameters={"operation": "add", "a": 2, "b": 2}
        )
    ]
)

# Evaluate with rubric
suite = EvalSuite(cases=[case1], rubric=rubric)

Multi-Run Evaluation

Run each case multiple times to measure consistency:

# Run via the CLI
# arcade evals eval_file.py --num-runs 5 --seed random --multi-run-pass-rule majority

# Or programmatically
result = await suite.run(
    client,
    model="gpt-4o",
    num_runs=5,            # Run each case 5 times
    seed="random",         # Different seed per run
    multi_run_pass_rule="majority",  # Pass if >50% of runs pass
)

Multi-run results include per-case statistics:

Mean score and standard deviation across runs
Per-run pass/fail with individual scores
Per-critic field score breakdowns across runs
Configurable pass rules: last (default), mean, or majority
Configurable seed policies: constant (fixed seed 42), random, or a specific integer

License

MIT License - see LICENSE file for details.