# MCP Server Tool Evaluation Support
## Overview
Add support for evaluating tools from remote MCP servers without
requiring Python callables. Enables direct evaluation of any
MCP-compatible tool server.
## What's New
### Core Features
- **`MCPToolRegistry`**: Evaluate tools from a single MCP server
- **`CompositeMCPRegistry`**: Evaluate tools from multiple MCP servers
simultaneously
- **Automatic loaders**: `load_from_stdio()` and `load_from_http()` to
fetch tools from running servers
- **Automatic namespacing**: Tools prefixed with server name (e.g.,
`server_tool_name`)
- **Smart name resolution**: Use short names if unique, full names if
ambiguous
- **OpenAI strict mode**: Automatic schema conversion prevents parameter
hallucinations
### Usage
**Automatic Loading:**
```python
from arcade_evals import load_from_stdio, MCPToolRegistry
# Load tools automatically from MCP server
tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-github"])
registry = MCPToolRegistry(tools)
```
**Single MCP Server:**
```python
from arcade_evals import MCPToolRegistry, ExpectedToolCall
registry = MCPToolRegistry(mcp_tools)
suite = EvalSuite(catalog=registry)
suite.add_case(
expected_tool_calls=[
ExpectedToolCall(tool_name="tool_name", args={...})
]
)
```
**Multiple MCP Servers:**
```python
from arcade_evals import CompositeMCPRegistry, load_from_stdio
# Load from multiple servers
github_tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-github"])
slack_tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-slack"])
composite = CompositeMCPRegistry(
tool_lists={
"github": github_tools,
"slack": slack_tools,
}
)
suite = EvalSuite(catalog=composite)
suite.add_case(
expected_tool_calls=[
ExpectedToolCall(tool_name="github_list_issues", args={...})
]
)
```
## Implementation
### Files Changed
- **`libs/arcade-evals/arcade_evals/registry.py`** (NEW): Registry
abstractions and implementations
- **`libs/arcade-evals/arcade_evals/loaders.py`** (NEW): Automatic tool
loading from MCP servers
- **`libs/arcade-evals/arcade_evals/eval.py`** (MODIFIED): Enhanced
`ExpectedToolCall` and evaluation logic
- **`libs/arcade-evals/arcade_evals/__init__.py`** (MODIFIED): Exported
new registries and loaders
### Key Technical Details
- Added `BaseToolRegistry` interface for abstraction
- `MCPToolRegistry` handles single server tools
- `CompositeMCPRegistry` manages multiple servers with collision
detection
- `load_from_stdio()` and `load_from_http()` for automatic tool
discovery
- Fixed name normalization bug: MCP tools use underscores (not dots)
- Optimized tool copying: 2.5x faster via shallow copy
## Testing
- ✅ 41 tests passing (25 new tests added)
- ✅ `test_eval_mcp_registry.py`: MCPToolRegistry functionality
- ✅ `test_eval_composite_mcp.py`: CompositeMCPRegistry with multiple
servers
- ✅ Verified backward compatibility with Python tools
## Backward Compatibility
✅ **100% backward compatible** - No breaking changes
## Breaking Changes
**None**
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Adds end-to-end eval UX: examples, a robust CLI runner, and rich
outputs.
>
> - **New examples**: `eval_arcade_gateway.py`,
`eval_stdio_mcp_server.py`, `eval_http_mcp_server.py`,
`eval_comprehensive_comparison.py` with timeouts, error handling, and
track-based comparisons; detailed `README.md`
> - **CLI runner**: `arcade_cli/evals_runner.py` to execute
evals/capture in parallel with progress, error isolation, failed-only
filtering, context inclusion, and multi-provider/model support
> - **Output formatters**: `arcade_cli/formatters/` (txt, md, html,
json) for evals and capture; comparative and multi-model HTML with tabs
and context rendering
> - **Display refactor**: `display.py` now supports writing multiple
formats, failed-only disclaimers, include-context, and improved console
summaries
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
ff8acf9c34a6b61462a019a1ee9df081006517d0. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Francisco Liberal <francisco@arcade.dev>
Co-authored-by: Mateo Torres <torresmateo@gmail.com>
231 lines
7.4 KiB
Python
231 lines
7.4 KiB
Python
from typing import Annotated
|
|
from unittest.mock import MagicMock
|
|
|
|
import pytest
|
|
from arcade_core.errors import ToolDefinitionError
|
|
from arcade_core.schema import (
|
|
ToolCallRequest,
|
|
ToolCallResponse,
|
|
ToolContext,
|
|
ToolReference,
|
|
)
|
|
from arcade_serve.core.base import BaseWorker
|
|
from arcade_serve.core.common import RequestData, Router
|
|
from arcade_serve.core.components import (
|
|
CallToolComponent,
|
|
CatalogComponent,
|
|
HealthCheckComponent,
|
|
)
|
|
from arcade_tdk import tool
|
|
|
|
|
|
@tool()
|
|
def sample_tool(
|
|
context: ToolContext, a: Annotated[int, "a"], b: Annotated[int, "b"]
|
|
) -> Annotated[int, "output"]:
|
|
"""Sample tool for testing."""
|
|
return a + b
|
|
|
|
|
|
# Define error tool at module level to avoid indentation issues with getsource
|
|
@tool()
|
|
def error_tool(context: ToolContext) -> int:
|
|
"""This tool always raises an error."""
|
|
raise ValueError("Something went wrong")
|
|
|
|
|
|
@pytest.fixture
|
|
def mock_router():
|
|
router = MagicMock(spec=Router)
|
|
router.add_route = MagicMock()
|
|
return router
|
|
|
|
|
|
@pytest.fixture
|
|
def base_worker(mock_router, monkeypatch):
|
|
# Set env var temporarily for testing secret loading
|
|
monkeypatch.setenv("ARCADE_WORKER_SECRET", "test_secret_env")
|
|
worker = BaseWorker()
|
|
worker.register_routes(mock_router) # Register routes using the mock router
|
|
return worker
|
|
|
|
|
|
@pytest.fixture
|
|
def base_worker_no_auth():
|
|
return BaseWorker(disable_auth=True)
|
|
|
|
|
|
# --- BaseWorker Tests ---
|
|
|
|
|
|
def test_base_worker_init_with_secret():
|
|
worker = BaseWorker(secret="explicit_secret") # noqa: S106
|
|
assert worker.secret == "explicit_secret" # noqa: S105
|
|
assert not worker.disable_auth
|
|
|
|
|
|
def test_base_worker_init_with_env_secret(monkeypatch):
|
|
monkeypatch.setenv("ARCADE_WORKER_SECRET", "env_secret_value")
|
|
worker = BaseWorker()
|
|
assert worker.secret == "env_secret_value" # noqa: S105
|
|
assert not worker.disable_auth
|
|
|
|
|
|
def test_base_worker_init_no_secret_raises_error(monkeypatch):
|
|
# Ensure env var is not set
|
|
monkeypatch.delenv("ARCADE_WORKER_SECRET", raising=False)
|
|
with pytest.raises(ValueError, match="No secret provided for worker"):
|
|
BaseWorker()
|
|
|
|
|
|
def test_base_worker_init_disable_auth():
|
|
worker = BaseWorker(disable_auth=True)
|
|
assert worker.secret == ""
|
|
assert worker.disable_auth
|
|
|
|
|
|
def test_register_tool(base_worker_no_auth):
|
|
assert len(base_worker_no_auth.catalog) == 0
|
|
base_worker_no_auth.register_tool(sample_tool, toolkit_name="test_kit")
|
|
assert len(base_worker_no_auth.catalog) == 1
|
|
tool_def = base_worker_no_auth.get_catalog()[0]
|
|
assert tool_def.name == "SampleTool"
|
|
assert tool_def.toolkit.name == "TestKit"
|
|
|
|
|
|
def test_get_catalog(base_worker_no_auth):
|
|
base_worker_no_auth.register_tool(sample_tool, toolkit_name="test_kit")
|
|
catalog = base_worker_no_auth.get_catalog()
|
|
assert isinstance(catalog, list)
|
|
assert len(catalog) == 1
|
|
assert catalog[0].name == "SampleTool"
|
|
|
|
|
|
def test_health_check(base_worker_no_auth):
|
|
base_worker_no_auth.register_tool(sample_tool, toolkit_name="test_kit")
|
|
health = base_worker_no_auth.health_check()
|
|
assert health == {"status": "ok", "tool_count": "1"}
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_call_tool_success(base_worker_no_auth):
|
|
base_worker_no_auth.register_tool(sample_tool, toolkit_name="test_kit")
|
|
# Create ToolReference WITHOUT version, as register_tool doesn't seem to set it
|
|
tool_ref = ToolReference(toolkit="TestKit", name="SampleTool")
|
|
tool_request = ToolCallRequest(
|
|
execution_id="test_exec_id",
|
|
tool=tool_ref,
|
|
inputs={"a": 5, "b": 3},
|
|
)
|
|
|
|
response = await base_worker_no_auth.call_tool(tool_request)
|
|
|
|
assert response.success is True
|
|
assert response.output.value == 8
|
|
assert response.output.error is None
|
|
assert response.execution_id == "test_exec_id"
|
|
assert response.duration > 0
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_call_tool_execution_error(base_worker_no_auth):
|
|
# Tool is now defined at module level
|
|
try:
|
|
base_worker_no_auth.register_tool(error_tool, toolkit_name="error_kit")
|
|
except ToolDefinitionError as e:
|
|
pytest.fail(f"Failed to register error_tool: {e}")
|
|
|
|
# Create ToolReference WITHOUT version
|
|
tool_ref = ToolReference(toolkit="ErrorKit", name="ErrorTool")
|
|
tool_request = ToolCallRequest(
|
|
execution_id="test_exec_error",
|
|
tool=tool_ref,
|
|
inputs={},
|
|
)
|
|
|
|
response = await base_worker_no_auth.call_tool(tool_request)
|
|
|
|
assert response.success is False
|
|
assert response.output.value is None
|
|
assert response.output.error is not None
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_call_tool_not_found(base_worker_no_auth):
|
|
# Use ToolReference without version for lookup consistency
|
|
tool_ref = ToolReference(toolkit="nonexistent", name="nosuchtool")
|
|
tool_request = ToolCallRequest(
|
|
execution_id="test_exec_notfound",
|
|
tool=tool_ref,
|
|
inputs={},
|
|
)
|
|
|
|
# Update regex to match actual error format
|
|
with pytest.raises(ValueError):
|
|
await base_worker_no_auth.call_tool(tool_request)
|
|
|
|
|
|
# --- Component Tests (tested via BaseWorker registration) ---
|
|
|
|
|
|
def test_register_routes_registers_default_components(base_worker, mock_router):
|
|
# BaseWorker calls register_routes in its init via the fixture
|
|
assert mock_router.add_route.call_count == len(BaseWorker.default_components)
|
|
|
|
calls = mock_router.add_route.call_args_list
|
|
expected_paths = ["tools", "tools/invoke", "health"]
|
|
registered_paths = [
|
|
call[0][0] for call in calls
|
|
] # call[0] are positional args, call[0][0] is endpoint_path
|
|
|
|
assert sorted(registered_paths) == sorted(expected_paths)
|
|
|
|
# Check if components were instantiated and passed to add_route
|
|
assert any(isinstance(call[0][1], CatalogComponent) for call in calls)
|
|
assert any(isinstance(call[0][1], CallToolComponent) for call in calls)
|
|
assert any(isinstance(call[0][1], HealthCheckComponent) for call in calls)
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_catalog_component_call(base_worker_no_auth):
|
|
base_worker_no_auth.register_tool(sample_tool, toolkit_name="test_kit")
|
|
component = CatalogComponent(base_worker_no_auth)
|
|
# Mock request data - not actually used by this component's __call__
|
|
mock_request = MagicMock(spec=RequestData)
|
|
catalog_response = await component(mock_request)
|
|
|
|
assert isinstance(catalog_response, list)
|
|
assert len(catalog_response) == 1
|
|
assert catalog_response[0].name == "SampleTool"
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_call_tool_component_call(base_worker_no_auth):
|
|
base_worker_no_auth.register_tool(sample_tool, toolkit_name="test_kit")
|
|
component = CallToolComponent(base_worker_no_auth)
|
|
|
|
# Create ToolReference WITHOUT version
|
|
tool_ref = ToolReference(toolkit="TestKit", name="SampleTool")
|
|
request_body = {
|
|
"execution_id": "comp_test_exec",
|
|
"tool": tool_ref.model_dump(),
|
|
"inputs": {"a": 10, "b": 5},
|
|
}
|
|
mock_request = MagicMock(spec=RequestData)
|
|
mock_request.body_json = request_body
|
|
|
|
response = await component(mock_request)
|
|
|
|
assert isinstance(response, ToolCallResponse)
|
|
assert response.success is True
|
|
assert response.output.value == 15
|
|
assert response.execution_id == "comp_test_exec"
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_health_check_component_call(base_worker_no_auth):
|
|
component = HealthCheckComponent(base_worker_no_auth)
|
|
mock_request = MagicMock(spec=RequestData)
|
|
health_response = await component(mock_request)
|
|
|
|
assert health_response == {"status": "ok", "tool_count": "0"}
|