History

Eric Gustin a4160dd9fe Four bug fixes (#754 ) 1. Resolves [TOO-363](https://linear.app/arcadedev/issue/TOO-363/arcade-deploy-fails-when-additional-deps-are-added-to-the-server). 2. Resolves [TOO-364](https://linear.app/arcadedev/issue/TOO-364/arcade-cores-tool-skip-logic-is-missing-case-for-direct-execution). 3. Resolves [TOO-358](https://linear.app/arcadedev/issue/TOO-358/missing-evals-error-message-shows-wrong-command). 4. Resolves [TOO-365](https://linear.app/arcadedev/issue/TOO-365/arcade-evals-unit-tests-are-hanging). <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Medium risk because it changes how `arcade deploy` spawns the server process and adjusts toolkit discovery skip logic, which can affect deployments and tool discovery; however, the changes are small and covered by new unit/integration tests. > > Overview > `arcade deploy` now starts the validation server using the project’s `.venv` interpreter (via `find_python_interpreter`) instead of the CLI’s own `sys.executable`, preventing missing dependency failures when the CLI is installed in an isolated env. > > `arcade-core`’s `Toolkit.tools_from_directory` skip logic is hardened to also skip the currently executing entrypoint by module name (`__main__.__spec__.name`) when file paths don’t match (e.g., bundled execution). CLI error printing now escapes plain messages to avoid rich markup issues, and `arcade-evals` lock acquisition accepts an optional timeout default. > > Adds unit tests for the new toolkit skip behavior and an integration test that boots the MCP server via direct Python invocation to mirror deployment behavior, and bumps `arcade-core`, `arcade-mcp-server`, and root dependency versions accordingly. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e7785634c231c059f2e0bd1bc73a56bd7470a494. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-01-29 15:12:06 -08:00
..
arcade_evals	Four bug fixes (#754 )	2026-01-29 15:12:06 -08:00
README.md	MCP Local (#563 )	2025-09-25 15:28:15 -07:00

1. Resolves
[TOO-363](https://linear.app/arcadedev/issue/TOO-363/arcade-deploy-fails-when-additional-deps-are-added-to-the-server).
2. Resolves
[TOO-364](https://linear.app/arcadedev/issue/TOO-364/arcade-cores-tool-skip-logic-is-missing-case-for-direct-execution).
3. Resolves
[TOO-358](https://linear.app/arcadedev/issue/TOO-358/missing-evals-error-message-shows-wrong-command).
4. Resolves
[TOO-365](https://linear.app/arcadedev/issue/TOO-365/arcade-evals-unit-tests-are-hanging).

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Medium risk because it changes how `arcade deploy` spawns the server
process and adjusts toolkit discovery skip logic, which can affect
deployments and tool discovery; however, the changes are small and
covered by new unit/integration tests.
> 
> **Overview**
> `arcade deploy` now starts the validation server using the project’s
`.venv` interpreter (via `find_python_interpreter`) instead of the CLI’s
own `sys.executable`, preventing missing dependency failures when the
CLI is installed in an isolated env.
> 
> `arcade-core`’s `Toolkit.tools_from_directory` skip logic is hardened
to also skip the currently executing entrypoint by module name
(`__main__.__spec__.name`) when file paths don’t match (e.g., bundled
execution). CLI error printing now escapes plain messages to avoid rich
markup issues, and `arcade-evals` lock acquisition accepts an optional
timeout default.
> 
> Adds unit tests for the new toolkit skip behavior and an integration
test that boots the MCP server via direct Python invocation to mirror
deployment behavior, and bumps `arcade-core`, `arcade-mcp-server`, and
root dependency versions accordingly.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e7785634c231c059f2e0bd1bc73a56bd7470a494. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

2026-01-29 15:12:06 -08:00

arcade_evals

Four bug fixes (#754 )

2026-01-29 15:12:06 -08:00

README.md

MCP Local (#563 )

2025-09-25 15:28:15 -07:00

README.md

Arcade Evals

Evaluation toolkit for testing Arcade tools.

Overview

Arcade Evals provides comprehensive evaluation capabilities for Arcade tools:

Evaluation Framework: Cases, suites, and rubrics for systematic testing
Critics: Different types of comparisons (binary, numeric, similarity, datetime)
Tool Evaluation: Decorators and utilities for evaluating tool performance
Result Analysis: Comprehensive evaluation results and reporting

Installation

pip install 'arcade-mcp[evals]'

Usage

Basic Evaluation

from arcade_evals import EvalCase, EvalSuite, tool_eval

# Create evaluation cases
case1 = EvalCase(
    input={"query": "What is 2+2?"},
    expected_output="4"
)

case2 = EvalCase(
    input={"query": "What is the capital of France?"},
    expected_output="Paris"
)

# Create evaluation suite
suite = EvalSuite(cases=[case1, case2])

# Evaluate a tool
@tool_eval(suite)
def my_calculator(query: str) -> str:
    # Tool implementation
    return "4" if "2+2" in query else "Unknown"

Using Critics

from arcade_evals import NumericCritic, SimilarityCritic

# Numeric comparison
numeric_critic = NumericCritic(tolerance=0.1)
result = numeric_critic.evaluate(expected=10.0, actual=10.05)

# Similarity comparison
similarity_critic = SimilarityCritic(threshold=0.8)
result = similarity_critic.evaluate(
    expected="The capital of France is Paris",
    actual="Paris is the capital of France"
)

Advanced Evaluation

from arcade_evals import EvalRubric, ExpectedToolCall

# Create rubric with tool calls
rubric = EvalRubric(
    expected_tool_calls=[
        ExpectedToolCall(
            tool_name="calculator",
            parameters={"operation": "add", "a": 2, "b": 2}
        )
    ]
)

# Evaluate with rubric
suite = EvalSuite(cases=[case1], rubric=rubric)

License

MIT License - see LICENSE file for details.