arcade-mcp/toolkits/math/evals/eval_math_tools.py
Nate Barbettini 894fa878f1
Fix ruff (#64)
On the last few PRs I have noticed two problems:
1. `ruff format` fails even though it seems OK on our local machines
(sometimes, not always)
2. Nate's and Sam's machines kept flip-flopping a specific piece of
formatting back and forth, indicating a subtle difference of config
hiding somewhere
3. This was reproducible by running `ruff format` in the terminal,
followed by `make check`. The former would edit files, and then `make
check` would edit them back!

This PR addresses both issues, and further standardizes our editor &
linter configs to be super stable.
Specifically:
1. The main fix for the above, the pre-commit hook was pinned to a super
old version of ruff.
This resulted in subtle differences in behavior between our machines,
and on CI.

2. Moved ruff settings from `pyproject.toml` to `.ruff.toml`
pyproject files in subdirectories (e.g. `toolkits/**`) were overriding
the main pyproject file and erasing the custom ruff config we set at the
root. This meant that our ruff config was applied to `arcade` but not to
any of the other packages.
By moving the config to `.ruff.toml` at the root, all projects will
inherit the same ruff linting & formatting config.

4. Un-ignored the `.vscode/` directory so that we can share
vscode/cursor workspace settings.
This is valuable for standardizing settings like the default formatter
(ruff) and default test framework (pytest).
However, it's important that going forward we _only_ commit things here
that should apply across all of our machines.

5. To avoid any conflict between prettier and ruff, prettier now
explicitly ignores *.py files

6. Finally, `ruff format` and `make check` agree. A number of files are
newly auto-formatted.
2024-09-25 09:47:30 -07:00

68 lines
1.6 KiB
Python

import arcade_math
from arcade_math.tools.arithmetic import add, sqrt
from arcade.core.catalog import ToolCatalog
from arcade.sdk.eval import (
BinaryCritic,
EvalRubric,
EvalSuite,
tool_eval,
)
# Evaluation rubric
rubric = EvalRubric(
fail_threshold=0.85,
warn_threshold=0.95,
)
catalog = ToolCatalog()
catalog.add_module(arcade_math)
@tool_eval()
def math_eval_suite():
suite = EvalSuite(
name="Math Tools Evaluation",
system_message="You are an AI assistant with access to math tools. Use them to help the user with their math-related tasks.",
catalog=catalog,
rubric=rubric,
)
suite.add_case(
name="Add two large numbers",
user_message="Add 12345 and 987654321",
expected_tool_calls=[
(
add,
{
"a": 12345,
"b": 987654321,
},
)
],
rubric=rubric,
critics=[
BinaryCritic(critic_field="a", weight=0.5), # TODO: weight should be optional
BinaryCritic(critic_field="b", weight=0.5),
],
)
suite.add_case(
name="Take the square root of a large number",
user_message="What is the square root of 3224990521?",
expected_tool_calls=[
(
sqrt,
{
"a": 3224990521,
},
)
],
rubric=rubric,
critics=[
BinaryCritic(critic_field="a", weight=1.0),
],
)
return suite