This PR introduces enhancements and fixes to the Evaluation Framework to improve accuracy, robustness, and performance. ## Key Improvements - **Refactored Evaluation Loading**: Extracted evaluation file loading and suite loading logic into utility functions `get_eval_files` and `load_eval_suites` in `arcade/cli/utils.py`. - **Benefit**: Enhances code modularity and maintainability. - **Asynchronous Execution of Evaluations**: Modified the evaluations to run asynchronously using `asyncio`. - **Benefit**: Significantly reduces total execution time when running multiple evaluations. - **Improved Error Handling**: Wrapped critic evaluations in try-except blocks to handle exceptions gracefully. - **Benefit**: Ensures that a single failing critic doesn't halt the entire evaluation process. ## Other Changes - **Case-Insensitive Tool Name Comparison**: Made tool name comparisons case-insensitive to improve robustness against casing differences. - **Refactored Cost Matrix Creation**: Revised cost matrix creation to handle varying numbers of expected and actual tool calls properly, ensuring accurate assignment and scoring. - **Type Casting in `BinaryCritic`**: Added type casting for actual values to match the expected value's type before comparison, improving accuracy in evaluations. - **Removed Synchronous Code Paths**: Simplified the codebase by removing synchronous evaluation methods, focusing on asynchronous execution. - **General Code Cleanup**: Removed unused imports and performed general code cleanup to enhance readability and maintainability. ## Example For the google calendar and gmail tools eval set you can run ```go > arcade evals . -c 8 --models gpt-4o,gpt-4o-mini,gpt-4,gpt-4-turbo Running evaluations in calendar_eval_suite Running evaluations in gmail_eval_suite Model: gpt-4o PASSED Create calendar event -- Score: 1.00 FAILED List calendar events -- Score: 0.86 PASSED Update a calendar event -- Score: 1.00 PASSED Delete a calendar event -- Score: 1.00 Model: gpt-4o-mini FAILED Create calendar event -- Score: 0.84 FAILED List calendar events -- Score: 0.86 PASSED Update a calendar event -- Score: 1.00 PASSED Delete a calendar event -- Score: 1.00 Model: gpt-4 PASSED Create calendar event -- Score: 1.00 FAILED List calendar events -- Score: 0.86 PASSED Update a calendar event -- Score: 1.00 PASSED Delete a calendar event -- Score: 1.00 Model: gpt-4-turbo PASSED Create calendar event -- Score: 1.00 FAILED List calendar events -- Score: 0.87 PASSED Update a calendar event -- Score: 1.00 PASSED Delete a calendar event -- Score: 1.00 Model: gpt-4o PASSED Send email to user with clear username -- Score: 1.00 Model: gpt-4o-mini PASSED Send email to user with clear username -- Score: 1.00 Model: gpt-4 PASSED Send email to user with clear username -- Score: 1.00 Model: gpt-4-turbo PASSED Send email to user with clear username -- Score: 1.00 Summary -- Total: 20 -- Passed: 15 -- Failed: 5 ``` |
||
|---|---|---|
| .github | ||
| .vscode | ||
| arcade | ||
| docker | ||
| examples | ||
| schemas/preview | ||
| toolkits | ||
| .editorconfig | ||
| .gitignore | ||
| .pre-commit-config.yaml | ||
| .prettierignore | ||
| .prettierrc.toml | ||
| .ruff.toml | ||
| CONTRIBUTING.md | ||
| cspell.config.yaml | ||
| LICENSE | ||
| Makefile | ||
| README.md | ||
Arcade AI
Arcade AI is the developer platform for building tools designed to be used with language models. With Arcade, developers can create, deploy, and easily integrate new tools with language models to enhance their capabilities.
arcade-ai
The arcade-ai package contains:
arcadeCLIarcade.sdkTool SDKarcade.actorserving tools with FastAPI, Flask, or Django
Installation
To install the Arcade AI package, execute the following command:
pip install arcade-ai
or install from source:
git clone https://github.com/arcadeai/arcade-ai.git
cd arcade-ai
pip install poetry
poetry install
First steps
Follow these steps if you've cloned the repo and installed the package from source:
cd examples/search
poetry install
arcade show arcade_search
This will show an output that looks like
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓ ┃ Name ┃ Description ┃ Toolkit ┃ Version ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩ │ SearchGoogle │ Search Google using SerpAPI and return organic search results. │ search │ 0.1.0 │ └──────────────┴────────────────────────────────────────────────────────────────┴───────────┴─────────┘
Predict the parameters with a model and run the tool with the predicted parameters. Arcade adds the execute choice to the tool, which allows you to run the tool with the predicted parameters in a single request.
> arcade run arcade_search "who is Sam Partee?" --choice "execute"
Running tool: SearchGoogle with params: {'query': 'Sam Partee'}
[{"position": 1, "title": "Sam Partee (@SamPartee) / X", "link": "https://twitter.com/sampartee", "redirect_link":
"https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://twitter.com/sampartee&ved=2ahUKEwjBwKiz3b6HAxV1VTABHXL8BZQQFnoECAYQAQ",
"displayed_link": "1.5K+ followers", "thumbnail":
.....
.. (truncated)
Arcade also adds the predict choice to the tool, which allows you to predict the parameters with a model.
> arcade run arcade_search "who is Sam Partee?" --choice "predict" # also the default
Running tool: SearchGoogle with params: {'query': 'Sam Partee'}
Sam Partee is a CTO, Co-founder of Arcade AI and former Machine Learning Engineer at companies like RedisInc and HPE_Cray. They have
expertise in AI/ML, vector search, Python, HPC, and are a sports fan.