arcade-mcp/toolkits/github/evals/eval_github_issues.py
Sam Partee b6b4cd0a4c
🏗️ Restructure: Multi-Package Architecture + uv Migration (#412)
### Overview
Major restructuring from monolithic `arcade-ai` package to modular
library architecture with standardized uv-based dependency management.

![arcade-ai Monorepo
(2)](https://github.com/user-attachments/assets/25f102b0-bb87-4a04-9701-d227d05664b1)

### New Package Structure
- **`arcade-tdk`** - Lightweight toolkit development kit (core
decorators, auth)
- **`arcade-core`** - Core execution engine and catalog functionality  
- **`arcade-serve`** - FastAPI/MCP server components
- **`arcade-ai`** - Meta package that includes CLI functionality.
Optionally include evals via the `evals` extra. Optionally include all
packages via the `all` extra.

### Key Benefits
- **Lighter Dependencies**: Toolkits now depend only on `arcade-tdk` (~2
deps) vs full `arcade-ai` (~30+ deps)
- **Faster Builds**: uv provides 10-100x faster dependency resolution
and installation
- **Better Modularity**: Clear separation of concerns, consumers import
only what they need
- **Standard Tooling**: Eliminates custom poetry scripts, uses standard
Python packaging

### Migration Impact
- All 20 toolkits converted from poetry → uv with `arcade-tdk`
dependencies plus `arcade-ai[evals]` and `arcade-serve` dev
dependencies. When developing locally, devs should install toolkits via
`make install-local`.
- Modern Python 3.10+ type hints throughout
- Standardized build system with hatchling backend
- Enhanced Makefile with robust toolkit management commands
- Removed `arcade dev` CLI command
- Reduce the number of files created by `arcade new` and add an option
to not generate a tests and evals folder.

This foundation enables faster development cycles and cleaner dependency
chains for the growing toolkit ecosystem.

### Todo After this PR is merged
- [ ] Post-merge workflow(s) (release & publish containers, etc)
- [ ] Release order plan. @EricGustin suggests releasing in the
following order:
    1. `arcade-core` version 0.1.0
    2. `arcade-serve` version 0.1.0 and `arcade-tdk` version 0.1.0
    3. `arcade-ai` version 2.0.0
4. Patch release for all toolkits (all changes in toolkits are internal
refactors)
- [ ] [Update docs](https://github.com/ArcadeAI/docs/pull/318)

---------

Co-authored-by: Eric Gustin <eric@arcade.dev>
Co-authored-by: Eric Gustin <34000337+EricGustin@users.noreply.github.com>
2025-06-11 16:48:17 -07:00

92 lines
3.1 KiB
Python

from arcade_evals import (
BinaryCritic,
EvalRubric,
EvalSuite,
ExpectedToolCall,
SimilarityCritic,
tool_eval,
)
from arcade_tdk import ToolCatalog
import arcade_github
from arcade_github.tools.issues import (
create_issue,
create_issue_comment,
)
# Evaluation rubric
rubric = EvalRubric(
fail_threshold=0.9,
warn_threshold=0.95,
)
catalog = ToolCatalog()
# Register the GitHub tools
catalog.add_module(arcade_github)
@tool_eval()
def github_issues_eval_suite() -> EvalSuite:
"""Evaluation suite for GitHub Issues tools."""
suite = EvalSuite(
name="GitHub Issues Tools Evaluation Suite",
system_message="You are an AI assistant that helps users interact with GitHub issues using the provided tools.",
catalog=catalog,
rubric=rubric,
)
# Create Issue
suite.add_case(
name="Create a new issue",
user_message="Create a new issue in the 'ArcadeAI/arcade-ai' repository with the title 'Bug: Login not working' and description 'Users are unable to log in to the application.' Assign the issue to TestUser, add it to milestone 1, and add the labels 'bug', and 'critical'.",
expected_tool_calls=[
ExpectedToolCall(
func=create_issue,
args={
"owner": "ArcadeAI",
"repo": "arcade-ai",
"title": "Bug: Login not working",
"body": "Users are unable to log in to the application.",
"assignees": ["TestUser"],
"milestone": 1,
"labels": ["bug", "critical"],
"include_extra_data": False,
},
)
],
critics=[
BinaryCritic(critic_field="owner", weight=0.2),
BinaryCritic(critic_field="repo", weight=0.2),
SimilarityCritic(critic_field="title", weight=0.2),
SimilarityCritic(critic_field="body", weight=0.1),
BinaryCritic(critic_field="assignees", weight=0.1),
BinaryCritic(critic_field="milestone", weight=0.1),
BinaryCritic(critic_field="labels", weight=0.1),
],
)
# Create Issue Comment
suite.add_case(
name="Add a comment to an existing issue",
user_message="Add a comment to issue #42 in the 'ArcadeAI/test' repository saying 'This issue is being investigated by the dev team.'",
expected_tool_calls=[
ExpectedToolCall(
func=create_issue_comment,
args={
"owner": "ArcadeAI",
"repo": "test",
"issue_number": 42,
"body": "This issue is being investigated by the dev team.",
"include_extra_data": False,
},
)
],
critics=[
SimilarityCritic(critic_field="owner", weight=0.2),
SimilarityCritic(critic_field="repo", weight=0.2),
BinaryCritic(critic_field="issue_number", weight=0.3),
SimilarityCritic(critic_field="body", weight=0.2),
],
)
return suite