arcade-mcp/toolkits/x/evals/eval_x_tools.py
Nate Barbettini 894fa878f1
Fix ruff (#64)
On the last few PRs I have noticed two problems:
1. `ruff format` fails even though it seems OK on our local machines
(sometimes, not always)
2. Nate's and Sam's machines kept flip-flopping a specific piece of
formatting back and forth, indicating a subtle difference of config
hiding somewhere
3. This was reproducible by running `ruff format` in the terminal,
followed by `make check`. The former would edit files, and then `make
check` would edit them back!

This PR addresses both issues, and further standardizes our editor &
linter configs to be super stable.
Specifically:
1. The main fix for the above, the pre-commit hook was pinned to a super
old version of ruff.
This resulted in subtle differences in behavior between our machines,
and on CI.

2. Moved ruff settings from `pyproject.toml` to `.ruff.toml`
pyproject files in subdirectories (e.g. `toolkits/**`) were overriding
the main pyproject file and erasing the custom ruff config we set at the
root. This meant that our ruff config was applied to `arcade` but not to
any of the other packages.
By moving the config to `.ruff.toml` at the root, all projects will
inherit the same ruff linting & formatting config.

4. Un-ignored the `.vscode/` directory so that we can share
vscode/cursor workspace settings.
This is valuable for standardizing settings like the default formatter
(ruff) and default test framework (pytest).
However, it's important that going forward we _only_ commit things here
that should apply across all of our machines.

5. To avoid any conflict between prettier and ruff, prettier now
explicitly ignores *.py files

6. Finally, `ruff format` and `make check` agree. A number of files are
newly auto-formatted.
2024-09-25 09:47:30 -07:00

57 lines
1.5 KiB
Python

import arcade_x
from arcade_x.tools.tweets import post_tweet
# TODO
# delete_tweet_by_id,
# search_recent_tweets_by_keywords,
# search_recent_tweets_by_username,
# from arcade_x.tools.users import lookup_single_user_by_username
from arcade.core.catalog import ToolCatalog
from arcade.sdk.eval import (
EvalRubric,
EvalSuite,
SimilarityCritic,
tool_eval,
)
# Evaluation rubric
rubric = EvalRubric(
fail_threshold=0.7,
warn_threshold=0.9,
)
catalog = ToolCatalog()
# Register the X tools
catalog.add_module(arcade_x)
@tool_eval()
def x_eval_suite() -> EvalSuite:
"""Evaluation suite for X (Twitter) tools."""
suite = EvalSuite(
name="X Tools Evaluation Suite",
system_message="You are an AI assistant with access to the X (Twitter) tools. Use them to help answer the user's X-related tasks/questions.",
catalog=catalog,
rubric=rubric,
)
# Add cases
suite.add_case(
name="Post a tweet",
user_message="Send out a tweet that says 'Hello World! Exciting stuff is happening over at Arcade AI!'",
expected_tool_calls=[
(
post_tweet,
{"tweet_text": "Hello World! Exciting stuff is happening over at Arcade AI!"},
)
],
critics=[
SimilarityCritic(
critic_field="tweet_text",
weight=1.0,
similarity_threshold=0.9,
),
],
)
return suite