On the last few PRs I have noticed two problems: 1. `ruff format` fails even though it seems OK on our local machines (sometimes, not always) 2. Nate's and Sam's machines kept flip-flopping a specific piece of formatting back and forth, indicating a subtle difference of config hiding somewhere 3. This was reproducible by running `ruff format` in the terminal, followed by `make check`. The former would edit files, and then `make check` would edit them back! This PR addresses both issues, and further standardizes our editor & linter configs to be super stable. Specifically: 1. The main fix for the above, the pre-commit hook was pinned to a super old version of ruff. This resulted in subtle differences in behavior between our machines, and on CI. 2. Moved ruff settings from `pyproject.toml` to `.ruff.toml` pyproject files in subdirectories (e.g. `toolkits/**`) were overriding the main pyproject file and erasing the custom ruff config we set at the root. This meant that our ruff config was applied to `arcade` but not to any of the other packages. By moving the config to `.ruff.toml` at the root, all projects will inherit the same ruff linting & formatting config. 4. Un-ignored the `.vscode/` directory so that we can share vscode/cursor workspace settings. This is valuable for standardizing settings like the default formatter (ruff) and default test framework (pytest). However, it's important that going forward we _only_ commit things here that should apply across all of our machines. 5. To avoid any conflict between prettier and ruff, prettier now explicitly ignores *.py files 6. Finally, `ruff format` and `make check` agree. A number of files are newly auto-formatted.
68 lines
1.6 KiB
Python
68 lines
1.6 KiB
Python
import arcade_math
|
|
from arcade_math.tools.arithmetic import add, sqrt
|
|
|
|
from arcade.core.catalog import ToolCatalog
|
|
from arcade.sdk.eval import (
|
|
BinaryCritic,
|
|
EvalRubric,
|
|
EvalSuite,
|
|
tool_eval,
|
|
)
|
|
|
|
# Evaluation rubric
|
|
rubric = EvalRubric(
|
|
fail_threshold=0.85,
|
|
warn_threshold=0.95,
|
|
)
|
|
|
|
|
|
catalog = ToolCatalog()
|
|
catalog.add_module(arcade_math)
|
|
|
|
|
|
@tool_eval()
|
|
def math_eval_suite():
|
|
suite = EvalSuite(
|
|
name="Math Tools Evaluation",
|
|
system_message="You are an AI assistant with access to math tools. Use them to help the user with their math-related tasks.",
|
|
catalog=catalog,
|
|
rubric=rubric,
|
|
)
|
|
|
|
suite.add_case(
|
|
name="Add two large numbers",
|
|
user_message="Add 12345 and 987654321",
|
|
expected_tool_calls=[
|
|
(
|
|
add,
|
|
{
|
|
"a": 12345,
|
|
"b": 987654321,
|
|
},
|
|
)
|
|
],
|
|
rubric=rubric,
|
|
critics=[
|
|
BinaryCritic(critic_field="a", weight=0.5), # TODO: weight should be optional
|
|
BinaryCritic(critic_field="b", weight=0.5),
|
|
],
|
|
)
|
|
|
|
suite.add_case(
|
|
name="Take the square root of a large number",
|
|
user_message="What is the square root of 3224990521?",
|
|
expected_tool_calls=[
|
|
(
|
|
sqrt,
|
|
{
|
|
"a": 3224990521,
|
|
},
|
|
)
|
|
],
|
|
rubric=rubric,
|
|
critics=[
|
|
BinaryCritic(critic_field="a", weight=1.0),
|
|
],
|
|
)
|
|
|
|
return suite
|