arcade-mcp/libs/arcade-evals
jottakka fe8ddfd500
[TOO-326] Windows papercuts (#768)
<!-- CURSOR_SUMMARY -->
> [!NOTE]
> **Medium Risk**
> Touches authentication/login flow, credentials-file permissions, and
subprocess lifecycle behavior across platforms; while mostly defensive,
regressions could impact login or process management on Windows/macOS
runners.
> 
> **Overview**
> Improves Windows/cross-platform reliability across the CLI and MCP
server: OAuth login now binds the callback server to `127.0.0.1`, avoids
slow loopback reverse-DNS, adds a configurable callback timeout
(`--timeout` + env default), and opens URLs via a Windows-friendly
`_open_browser` to avoid flashing console windows.
> 
> Centralizes CLI output via a shared `console` that forces UTF-8 on
Windows, standardizes UTF-8 file reads/writes throughout, tightens
credentials-file permissions on Windows using `icacls`, and adds shared
Windows subprocess helpers for **no-window** process creation and
graceful termination (used by `deploy`, MCP reload, and usage-tracking
worker).
> 
> Updates client configuration UX/robustness (Windows AppData resolution
via `platformdirs`, Cursor config path fallbacks + compatibility writes,
overwrite warnings, absolute `uv` path for GUI clients, safer path
display) and improves `deploy` child-process handling to avoid
pipe-buffer deadlocks while giving better debug-aware error messages.
> 
> Expands CI to run tests on Linux/Windows/macOS, adds a no-auth CLI
integration workflow, disables usage tracking in toolkits CI, and adds
extensive regression tests for Windows signals, subprocess cleanup,
UTF-8, and config-path edge cases; bumps `arcade-core` to `4.4.2` and
`arcade-mcp-server` to `1.17.2` (with updated dependency pin).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
0fabd8ca1cd647039ba6ddbdf3f7809c330bab9e. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-02-25 13:18:16 -03:00
..
arcade_evals [TOO-326] Windows papercuts (#768) 2026-02-25 13:18:16 -03:00
README.md Fixing bug with multiple providers + stats for multiple runs (#752) 2026-02-09 14:25:28 -03:00

Arcade Evals

Evaluation toolkit for testing Arcade tools.

Overview

Arcade Evals provides comprehensive evaluation capabilities for Arcade tools:

  • Evaluation Framework: Cases, suites, and rubrics for systematic testing
  • Critics: Different types of comparisons (binary, numeric, similarity, datetime)
  • Tool Evaluation: Decorators and utilities for evaluating tool performance
  • Multi-Run Statistics: Run each case multiple times with configurable seed policies and pass rules to measure consistency
  • Comparative Evaluation: Compare tool performance across multiple sources/tracks side-by-side
  • Capture Mode: Record model tool calls without scoring for debugging and baseline generation
  • Result Analysis: Comprehensive evaluation results and reporting in multiple formats (text, markdown, HTML, JSON)

Installation

pip install 'arcade-mcp[evals]'

Usage

Basic Evaluation

from arcade_evals import EvalCase, EvalSuite, tool_eval

# Create evaluation cases
case1 = EvalCase(
    input={"query": "What is 2+2?"},
    expected_output="4"
)

case2 = EvalCase(
    input={"query": "What is the capital of France?"},
    expected_output="Paris"
)

# Create evaluation suite
suite = EvalSuite(cases=[case1, case2])

# Evaluate a tool
@tool_eval(suite)
def my_calculator(query: str) -> str:
    # Tool implementation
    return "4" if "2+2" in query else "Unknown"

Using Critics

from arcade_evals import NumericCritic, SimilarityCritic

# Numeric comparison
numeric_critic = NumericCritic(tolerance=0.1)
result = numeric_critic.evaluate(expected=10.0, actual=10.05)

# Similarity comparison
similarity_critic = SimilarityCritic(threshold=0.8)
result = similarity_critic.evaluate(
    expected="The capital of France is Paris",
    actual="Paris is the capital of France"
)

Advanced Evaluation

from arcade_evals import EvalRubric, ExpectedToolCall

# Create rubric with tool calls
rubric = EvalRubric(
    expected_tool_calls=[
        ExpectedToolCall(
            tool_name="calculator",
            parameters={"operation": "add", "a": 2, "b": 2}
        )
    ]
)

# Evaluate with rubric
suite = EvalSuite(cases=[case1], rubric=rubric)

Multi-Run Evaluation

Run each case multiple times to measure consistency:

# Run via the CLI
# arcade evals eval_file.py --num-runs 5 --seed random --multi-run-pass-rule majority

# Or programmatically
result = await suite.run(
    client,
    model="gpt-4o",
    num_runs=5,            # Run each case 5 times
    seed="random",         # Different seed per run
    multi_run_pass_rule="majority",  # Pass if >50% of runs pass
)

Multi-run results include per-case statistics:

  • Mean score and standard deviation across runs
  • Per-run pass/fail with individual scores
  • Per-critic field score breakdowns across runs
  • Configurable pass rules: last (default), mean, or majority
  • Configurable seed policies: constant (fixed seed 42), random, or a specific integer

License

MIT License - see LICENSE file for details.