# PR Description
This PR adds ~~four~~ three improvements to evals.
~~## 1. Add parameterized eval cases~~
~~Adds a new method named `add_parameterized_case`. Just like pytest’s
parameterized tests, eval cases can be parameterized with multiple user
messages. Adds a case to the `EvalSuite` for each user message. All
cases have the same expected tool call(s), params, additional_messages.
This reduces duplicate code and makes it easy to observe how a model
performs based on increasingly more difficult prompts.~~
```python
""" NO LONGER IN THIS PR
user_messages = [
"Call the delete tweet by id tool with the tweet ID '148975632'.",
"Delete the tweet with ID '148975632'.",
"I don't want to have this tweet (148975632) on my account anymore.",
"do the opposite of post for https://x.com/x/status/148975632",
]
suite.add_parameterized_case(
name="Delete a tweet by ID",
user_messages=user_messages,
expected_tool_calls=[
ExpectedToolCall(
func=delete_tweet_by_id,
args={"tweet_id": "148975632"},
)
],
critics=[
BinaryCritic(
critic_field="tweet_id",
weight=1.0,
),
],
)
"""
```
~~PASSED Delete a tweet by ID (user_message 1 of 4) -- Score: 100.00%~~
~~PASSED Delete a tweet by ID (user_message 2 of 4) -- Score: 100.00%~~
~~PASSED Delete a tweet by ID (user_message 3 of 4) -- Score: 100.00%~~
~~FAILED Delete a tweet by ID (user_message 4 of 4) -- Score: 0.00%~~
~~Summary -- Total: 4 -- Passed: 3 -- Failed: 1~~
## 2. Parameters that are not explicitly criticized are assigned a
`NoneCritic`.
A NoneCritic has no effect on the evaluation results and does not
actually evaluate. Parameters that have a NoneCritic will be displayed
as ‘un-criticized’ in the evaluation summary (if `-d` flag is used).

## 3. Add a hardcoded `seed` parameter for evals.
The seed parameter aides in receiving (mostly) consistent outputs -
aiding in reproducibility for evaluations.
## 4. Disallow more than one critic for the same field.
Raises a `ValueError` if more than one critic is assigned to a field.
---------
Co-authored-by: Eric Gustin <eric@arcade-ai.com>
# PR Description
* Adds/updates the following files to all toolkits:
- `.pre-commit-config.yaml`
- `.ruff.toml`
- `LICENSE`
- `Makefile`
- `pyproject.toml`
* Lint all toolkits such that they pass `make check` and `make test` (a
total doozy). This includes adding some unit tests and evals.
* Github workflow for testing toolkits before merge into main (courtesy
of @sdreyer)
* Added a QOL improvement for tool developers for when they need to get
the context's auth token.
* Minor updates to `arcade new` template.
# PR Description
This PR renames `ExpectedToolCall` to `NamedExpectedToolCall` and then
creates a new dataclass called `ExpectedToolCall`. `ExpectedToolCall`
can be passed to the `EvalSuite.add_case` and `EvalSuite.extend_case`
methods.
1. Enhance `EvalSuite.add_case` and `EvalSuite.extend_case` by accepting
a list of `ExpectedToolCall` as their `expected_tool_calls` input
parameter. This helps create a scaffolding for developers. Previously,
the expected type was `list[tuple[Callable, dict[str, Any]]]`, which is
still valid for backward compatibility.
```python
# Before (still valid for backward compatibility)
expected_tool_calls=[
(
adjust_playback_position,
{
"absolute_position_ms": 10000,
},
)
]
# After
expected_tool_calls=[
ExpectedToolCall(
func=adjust_playback_position,
args={"absolute_position_ms": 10000},
)
]
```
2. Removed any references to arcade.core in toolkits directory.
3. Some linting for import organization.
This PR ensures that `arcade.core` does not show up anywhere in "user
space". This is crucial for helping developers understand what objects
are safe to use, and helps maintain a good developer experience.
Specific changes:
- `ToolAuthorizationContext` and `ToolContext` are now visible via
`arcade.sdk`
- `ToolCatalog` is now visible via `arcade.sdk`
- `Toolkit` is now visible via `arcade.sdk`
- `config` is now visible via `arcade.sdk.config`
**New Tools Added**
- `docs.py`: Provides tools for Google Docs functionalities, including
creating documents and inserting text.
- `drive.py`: Introduces tools for Google Drive operations, such as
listing documents.
This PR also focuses on simplifying the error handling logic in the Google
toolkit, specifically within the Calendar and Gmail tools. The primary
change involves removing redundant `try-except` blocks that were
catching `HttpError` and general exceptions, and re-raising them as
`ToolExecutionError`. By removing these blocks, we allow exceptions to
propagate naturally, and be handled by the ``ToolExecutor``
---------
Co-authored-by: Eric Gustin <eric@arcade-ai.com>