MCP Server Framework and Tool Development library for building custom capabilities into agents.
Find a file
Sam Partee b52f6daa6f
Eval Framework Improvements (#86)
This PR introduces enhancements and fixes to the Evaluation Framework to
improve accuracy, robustness, and performance.

## Key Improvements

- **Refactored Evaluation Loading**: Extracted evaluation file loading
and suite loading logic into utility functions `get_eval_files` and
`load_eval_suites` in `arcade/cli/utils.py`.
  - **Benefit**: Enhances code modularity and maintainability.

- **Asynchronous Execution of Evaluations**: Modified the evaluations to
run asynchronously using `asyncio`.
- **Benefit**: Significantly reduces total execution time when running
multiple evaluations.

- **Improved Error Handling**: Wrapped critic evaluations in try-except
blocks to handle exceptions gracefully.
- **Benefit**: Ensures that a single failing critic doesn't halt the
entire evaluation process.

## Other Changes

- **Case-Insensitive Tool Name Comparison**: Made tool name comparisons
case-insensitive to improve robustness against casing differences.

- **Refactored Cost Matrix Creation**: Revised cost matrix creation to
handle varying numbers of expected and actual tool calls properly,
ensuring accurate assignment and scoring.

- **Type Casting in `BinaryCritic`**: Added type casting for actual
values to match the expected value's type before comparison, improving
accuracy in evaluations.

- **Removed Synchronous Code Paths**: Simplified the codebase by
removing synchronous evaluation methods, focusing on asynchronous
execution.

- **General Code Cleanup**: Removed unused imports and performed general
code cleanup to enhance readability and maintainability.


## Example

For the google calendar and gmail tools eval set you can run

```go
> arcade evals . -c 8 --models gpt-4o,gpt-4o-mini,gpt-4,gpt-4-turbo

Running evaluations in calendar_eval_suite
Running evaluations in gmail_eval_suite
Model: gpt-4o
PASSED Create calendar event -- Score: 1.00
FAILED List calendar events -- Score: 0.86
PASSED Update a calendar event -- Score: 1.00
PASSED Delete a calendar event -- Score: 1.00
Model: gpt-4o-mini
FAILED Create calendar event -- Score: 0.84
FAILED List calendar events -- Score: 0.86
PASSED Update a calendar event -- Score: 1.00
PASSED Delete a calendar event -- Score: 1.00
Model: gpt-4
PASSED Create calendar event -- Score: 1.00
FAILED List calendar events -- Score: 0.86
PASSED Update a calendar event -- Score: 1.00
PASSED Delete a calendar event -- Score: 1.00
Model: gpt-4-turbo
PASSED Create calendar event -- Score: 1.00
FAILED List calendar events -- Score: 0.87
PASSED Update a calendar event -- Score: 1.00
PASSED Delete a calendar event -- Score: 1.00
Model: gpt-4o
PASSED Send email to user with clear username -- Score: 1.00
Model: gpt-4o-mini
PASSED Send email to user with clear username -- Score: 1.00
Model: gpt-4
PASSED Send email to user with clear username -- Score: 1.00
Model: gpt-4-turbo
PASSED Send email to user with clear username -- Score: 1.00
Summary -- Total: 20 -- Passed: 15 -- Failed: 5
```
2024-10-04 12:09:58 -07:00
.github Fix Package Builds (#82) 2024-10-02 10:51:18 -07:00
.vscode Fix ruff (#64) 2024-09-25 09:47:30 -07:00
arcade Eval Framework Improvements (#86) 2024-10-04 12:09:58 -07:00
docker Fix Poetry install order (#80) 2024-10-01 23:48:51 -07:00
examples Fix small bugs in pyproject.tomls (#88) 2024-10-04 12:08:03 -07:00
schemas/preview SDK: Generic OAuth 2.0 connector (#81) 2024-10-03 16:40:02 -07:00
toolkits Fix small bugs in pyproject.tomls (#88) 2024-10-04 12:08:03 -07:00
.editorconfig Fix ruff (#64) 2024-09-25 09:47:30 -07:00
.gitignore Fix ruff (#64) 2024-09-25 09:47:30 -07:00
.pre-commit-config.yaml Fix ruff (#64) 2024-09-25 09:47:30 -07:00
.prettierignore Fix ruff (#64) 2024-09-25 09:47:30 -07:00
.prettierrc.toml Fix ruff (#64) 2024-09-25 09:47:30 -07:00
.ruff.toml Toolkit lint cleanup (#72) 2024-10-01 10:41:38 -07:00
CONTRIBUTING.md MyPy Compliant (#5) 2024-07-16 17:01:38 -07:00
cspell.config.yaml SDK: Generic OAuth 2.0 connector (#81) 2024-10-03 16:40:02 -07:00
LICENSE Tool SDK, Schemas (#2) 2024-07-14 23:37:46 -07:00
Makefile Fix small bugs in pyproject.tomls (#88) 2024-10-04 12:08:03 -07:00
README.md Add initial X toolkit, remove Github toolkit, rename math toolkit (#52) 2024-09-23 13:42:22 -07:00

Release Build status codecov Commit activity License

Arcade AI

Arcade AI is the developer platform for building tools designed to be used with language models. With Arcade, developers can create, deploy, and easily integrate new tools with language models to enhance their capabilities.

arcade-ai

The arcade-ai package contains:

  • arcade CLI
  • arcade.sdk Tool SDK
  • arcade.actor serving tools with FastAPI, Flask, or Django

Installation

To install the Arcade AI package, execute the following command:

pip install arcade-ai

or install from source:

git clone https://github.com/arcadeai/arcade-ai.git
cd arcade-ai
pip install poetry
poetry install

First steps

Follow these steps if you've cloned the repo and installed the package from source:

cd examples/search
poetry install

arcade show arcade_search

This will show an output that looks like

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓ ┃ Name ┃ Description ┃ Toolkit ┃ Version ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩ │ SearchGoogle │ Search Google using SerpAPI and return organic search results. │ search │ 0.1.0 │ └──────────────┴────────────────────────────────────────────────────────────────┴───────────┴─────────┘

Predict the parameters with a model and run the tool with the predicted parameters. Arcade adds the execute choice to the tool, which allows you to run the tool with the predicted parameters in a single request.

> arcade run arcade_search "who is Sam Partee?" --choice "execute"
Running tool: SearchGoogle with params: {'query': 'Sam Partee'}

[{"position": 1, "title": "Sam Partee (@SamPartee) / X", "link": "https://twitter.com/sampartee", "redirect_link":
"https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://twitter.com/sampartee&ved=2ahUKEwjBwKiz3b6HAxV1VTABHXL8BZQQFnoECAYQAQ",
"displayed_link": "1.5K+ followers", "thumbnail":
.....
.. (truncated)

Arcade also adds the predict choice to the tool, which allows you to predict the parameters with a model.

> arcade run arcade_search "who is Sam Partee?" --choice "predict" # also the default
Running tool: SearchGoogle with params: {'query': 'Sam Partee'}

Sam Partee is a CTO, Co-founder of Arcade AI and former Machine Learning Engineer at companies like RedisInc and HPE_Cray. They have
expertise in AI/ML, vector search, Python, HPC, and are a sports fan.