arcade-mcp

Author	SHA1	Message	Date
Eric Gustin	be2539602f	Evals New Features (#208 ) # PR Description This PR adds ~~four~~ three improvements to evals. ~~## 1. Add parameterized eval cases~~ ~~Adds a new method named `add_parameterized_case`. Just like pytest’s parameterized tests, eval cases can be parameterized with multiple user messages. Adds a case to the `EvalSuite` for each user message. All cases have the same expected tool call(s), params, additional_messages. This reduces duplicate code and makes it easy to observe how a model performs based on increasingly more difficult prompts.~~ ```python """ NO LONGER IN THIS PR user_messages = [ "Call the delete tweet by id tool with the tweet ID '148975632'.", "Delete the tweet with ID '148975632'.", "I don't want to have this tweet (148975632) on my account anymore.", "do the opposite of post for https://x.com/x/status/148975632", ] suite.add_parameterized_case( name="Delete a tweet by ID", user_messages=user_messages, expected_tool_calls=[ ExpectedToolCall( func=delete_tweet_by_id, args={"tweet_id": "148975632"}, ) ], critics=[ BinaryCritic( critic_field="tweet_id", weight=1.0, ), ], ) """ ``` ~~PASSED Delete a tweet by ID (user_message 1 of 4) -- Score: 100.00%~~ ~~PASSED Delete a tweet by ID (user_message 2 of 4) -- Score: 100.00%~~ ~~PASSED Delete a tweet by ID (user_message 3 of 4) -- Score: 100.00%~~ ~~FAILED Delete a tweet by ID (user_message 4 of 4) -- Score: 0.00%~~ ~~Summary -- Total: 4 -- Passed: 3 -- Failed: 1~~ ## 2. Parameters that are not explicitly criticized are assigned a `NoneCritic`. A NoneCritic has no effect on the evaluation results and does not actually evaluate. Parameters that have a NoneCritic will be displayed as ‘un-criticized’ in the evaluation summary (if `-d` flag is used). ![image](https://github.com/user-attachments/assets/300756ec-9b53-436a-9cf9-fc61d0b00c01) ## 3. Add a hardcoded `seed` parameter for evals. The seed parameter aides in receiving (mostly) consistent outputs - aiding in reproducibility for evaluations. ## 4. Disallow more than one critic for the same field. Raises a `ValueError` if more than one critic is assigned to a field. --------- Co-authored-by: Eric Gustin <eric@arcade-ai.com>	2025-02-05 15:22:08 -08:00
Eric Gustin	ab889f9f1d	Lint all toolkits (#183 ) # PR Description * Adds/updates the following files to all toolkits: - `.pre-commit-config.yaml` - `.ruff.toml` - `LICENSE` - `Makefile` - `pyproject.toml` * Lint all toolkits such that they pass `make check` and `make test` (a total doozy). This includes adding some unit tests and evals. * Github workflow for testing toolkits before merge into main (courtesy of @sdreyer) * Added a QOL improvement for tool developers for when they need to get the context's auth token. * Minor updates to `arcade new` template.	2024-12-20 09:49:45 -08:00
Eric Gustin	7c228a59d5	Update Evals SDK (#175 ) # PR Description This PR renames `ExpectedToolCall` to `NamedExpectedToolCall` and then creates a new dataclass called `ExpectedToolCall`. `ExpectedToolCall` can be passed to the `EvalSuite.add_case` and `EvalSuite.extend_case` methods. 1. Enhance `EvalSuite.add_case` and `EvalSuite.extend_case` by accepting a list of `ExpectedToolCall` as their `expected_tool_calls` input parameter. This helps create a scaffolding for developers. Previously, the expected type was `list[tuple[Callable, dict[str, Any]]]`, which is still valid for backward compatibility. ```python # Before (still valid for backward compatibility) expected_tool_calls=[ ( adjust_playback_position, { "absolute_position_ms": 10000, }, ) ] # After expected_tool_calls=[ ExpectedToolCall( func=adjust_playback_position, args={"absolute_position_ms": 10000}, ) ] ``` 2. Removed any references to arcade.core in toolkits directory. 3. Some linting for import organization.	2024-12-19 10:29:13 -08:00
Nate Barbettini	036ad54ac6	Remove arcade.core from all examples (#121 ) This PR ensures that `arcade.core` does not show up anywhere in "user space". This is crucial for helping developers understand what objects are safe to use, and helps maintain a good developer experience. Specific changes: - `ToolAuthorizationContext` and `ToolContext` are now visible via `arcade.sdk` - `ToolCatalog` is now visible via `arcade.sdk` - `Toolkit` is now visible via `arcade.sdk` - `config` is now visible via `arcade.sdk.config`	2024-10-24 17:08:04 -07:00
Sam Partee	63cabe8f1f	Google Toolkit (Drive, Docs) (#97 ) New Tools Added - `docs.py`: Provides tools for Google Docs functionalities, including creating documents and inserting text. - `drive.py`: Introduces tools for Google Drive operations, such as listing documents. This PR also focuses on simplifying the error handling logic in the Google toolkit, specifically within the Calendar and Gmail tools. The primary change involves removing redundant `try-except` blocks that were catching `HttpError` and general exceptions, and re-raising them as `ToolExecutionError`. By removing these blocks, we allow exceptions to propagate naturally, and be handled by the ``ToolExecutor`` --------- Co-authored-by: Eric Gustin <eric@arcade-ai.com>	2024-10-08 17:04:16 -07:00

5 commits