On the last few PRs I have noticed two problems: 1. `ruff format` fails even though it seems OK on our local machines (sometimes, not always) 2. Nate's and Sam's machines kept flip-flopping a specific piece of formatting back and forth, indicating a subtle difference of config hiding somewhere 3. This was reproducible by running `ruff format` in the terminal, followed by `make check`. The former would edit files, and then `make check` would edit them back! This PR addresses both issues, and further standardizes our editor & linter configs to be super stable. Specifically: 1. The main fix for the above, the pre-commit hook was pinned to a super old version of ruff. This resulted in subtle differences in behavior between our machines, and on CI. 2. Moved ruff settings from `pyproject.toml` to `.ruff.toml` pyproject files in subdirectories (e.g. `toolkits/**`) were overriding the main pyproject file and erasing the custom ruff config we set at the root. This meant that our ruff config was applied to `arcade` but not to any of the other packages. By moving the config to `.ruff.toml` at the root, all projects will inherit the same ruff linting & formatting config. 4. Un-ignored the `.vscode/` directory so that we can share vscode/cursor workspace settings. This is valuable for standardizing settings like the default formatter (ruff) and default test framework (pytest). However, it's important that going forward we _only_ commit things here that should apply across all of our machines. 5. To avoid any conflict between prettier and ruff, prettier now explicitly ignores *.py files 6. Finally, `ruff format` and `make check` agree. A number of files are newly auto-formatted.
57 lines
1.5 KiB
Python
57 lines
1.5 KiB
Python
import arcade_x
|
|
from arcade_x.tools.tweets import post_tweet
|
|
|
|
# TODO
|
|
# delete_tweet_by_id,
|
|
# search_recent_tweets_by_keywords,
|
|
# search_recent_tweets_by_username,
|
|
# from arcade_x.tools.users import lookup_single_user_by_username
|
|
from arcade.core.catalog import ToolCatalog
|
|
from arcade.sdk.eval import (
|
|
EvalRubric,
|
|
EvalSuite,
|
|
SimilarityCritic,
|
|
tool_eval,
|
|
)
|
|
|
|
# Evaluation rubric
|
|
rubric = EvalRubric(
|
|
fail_threshold=0.7,
|
|
warn_threshold=0.9,
|
|
)
|
|
|
|
catalog = ToolCatalog()
|
|
# Register the X tools
|
|
catalog.add_module(arcade_x)
|
|
|
|
|
|
@tool_eval()
|
|
def x_eval_suite() -> EvalSuite:
|
|
"""Evaluation suite for X (Twitter) tools."""
|
|
|
|
suite = EvalSuite(
|
|
name="X Tools Evaluation Suite",
|
|
system_message="You are an AI assistant with access to the X (Twitter) tools. Use them to help answer the user's X-related tasks/questions.",
|
|
catalog=catalog,
|
|
rubric=rubric,
|
|
)
|
|
|
|
# Add cases
|
|
suite.add_case(
|
|
name="Post a tweet",
|
|
user_message="Send out a tweet that says 'Hello World! Exciting stuff is happening over at Arcade AI!'",
|
|
expected_tool_calls=[
|
|
(
|
|
post_tweet,
|
|
{"tweet_text": "Hello World! Exciting stuff is happening over at Arcade AI!"},
|
|
)
|
|
],
|
|
critics=[
|
|
SimilarityCritic(
|
|
critic_field="tweet_text",
|
|
weight=1.0,
|
|
similarity_threshold=0.9,
|
|
),
|
|
],
|
|
)
|
|
return suite
|