@EricGustin you can use this cli command:
```
uv run arcade evals mcp_building_evals_results/eval_toolkit_iteration_dict.py \
-p openai:gpt-4o,gpt-4o-mini \
-p anthropic:claude-sonnet-4-20250514 \
-k openai:$OPENAI_API_KEY \
-k anthropic:$ANTHROPIC_API_KEY \
-d \
--num-runs 3 \
--seed random \
--multi-run-pass-rule majority \
--max-concurrent 6 \
-o mcp_building_evals_results/results
```
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Touches core eval execution and all result formatters while adding new
CLI inputs and output schema (`run_stats`/`critic_stats` and capture
`runs`), so regressions could affect evaluation results and report
compatibility despite being additive and validated.
>
> **Overview**
> Adds **multi-run evaluation support** to `arcade evals` via new flags
`--num-runs`, `--seed`, and `--multi-run-pass-rule`, with upfront
validation and plumbing through the CLI runner into eval/capture suite
execution.
>
> Fixes provider selection UX/bug by making `--use-provider/-p`
**repeatable** (instead of a space-delimited string), updates
docs/examples accordingly, and extends capture mode to optionally record
**per-run tool calls** (`CapturedRun`) when `num_runs > 1`.
>
> Enhances all output formatters (HTML/Markdown/Text/JSON) to
**propagate and display** per-case `run_stats` and `critic_stats`,
including new HTML UI for run tabs/cards and comparative tables showing
mean ± stddev when multi-run data is present.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
2ee1654b7d1fbb9538373507355636164b16a066. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
48 lines
1.5 KiB
Python
48 lines
1.5 KiB
Python
from arcade_evals import (
|
|
EvalRubric,
|
|
EvalSuite,
|
|
ExpectedToolCall,
|
|
SimilarityCritic,
|
|
tool_eval,
|
|
)
|
|
from arcade_tdk import ToolCatalog
|
|
|
|
import arcade_linkedin
|
|
from arcade_linkedin.tools.share import create_text_post
|
|
|
|
rubric = EvalRubric(
|
|
fail_threshold=0.85,
|
|
warn_threshold=0.95,
|
|
)
|
|
|
|
|
|
catalog = ToolCatalog()
|
|
catalog.add_module(arcade_linkedin)
|
|
|
|
|
|
@tool_eval()
|
|
def linkedin_eval_suite() -> EvalSuite:
|
|
suite = EvalSuite(
|
|
name="LinkedIn Tools Evaluation",
|
|
system_message="You are an AI assistant with access to LinkedIn tools. Use them to help the user with their tasks.",
|
|
catalog=catalog,
|
|
rubric=rubric,
|
|
)
|
|
|
|
suite.add_case(
|
|
name="Run code",
|
|
user_message="post this transcription to linkedin. there may be some things that you need to clean up since it was spoken.: 'It is with great pleasure that I announce that I am now a member of the LinkedIn community! I'd like to thank the LinkedIn team for their support and encouragement in my journey to success. hash tag Y2K'",
|
|
expected_tool_calls=[
|
|
ExpectedToolCall(
|
|
func=create_text_post,
|
|
args={
|
|
"text": "It is with great pleasure that I announce that I am now a member of the LinkedIn community! I'd like to thank the LinkedIn team for their support and encouragement in my journey to success. #Y2K",
|
|
},
|
|
)
|
|
],
|
|
critics=[
|
|
SimilarityCritic(critic_field="text", weight=1.0),
|
|
],
|
|
)
|
|
|
|
return suite
|