.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorial/task_eval.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorial_task_eval.py: .. _eval: Evaluation ========================= AgentScope provides a built-in evaluation framework for assessing agent performance across different tasks and benchmarks, featuring: - `Ray `_-based parallel and distributed evaluation - Support continuation after interruption - 🚧 Visualization of evaluation results .. note:: We are keeping integrating new benchmarks into AgentScope: - ✅ `ACEBench `_ - 🚧 `GAIA `_ Benchmark Overview --------------------------- The AgentScope evaluation framework consists of several key components: - **Benchmark**: Collections of tasks for systematic evaluation - **Task**: Individual evaluation units with inputs, ground truth, and metrics - **Metric**: Measurement functions that assess solution quality - **Evaluator**: Engine that runs evaluation, aggregates results, and analyzes performance - **Evaluator Storage**: Persistent storage for recording and retrieving evaluation results - **Solution**: The user-defined solution .. figure:: ../../_static/images/evaluation.png :width: 90% :alt: AgentScope Evaluation Framework *AgentScope Evaluation Framework* The current implementation in AgentScope includes: - Evaluator: - ``RayEvaluator``: A ray-based evaluator that supports parallel and distributed evaluation. - ``GeneralEvaluator``: A general evaluator that runs tasks sequentially, friendly for debugging. - Benchmark: - ``ACEBench``: A benchmark for evaluating agent capabilities. We have provided a toy example in our `GitHub repository `_ with ``RayEvaluator`` and the agent multistep tasks in ACEBench. Core Components --------------- We are going to build a simple toy math question benchmark to demonstrate how to use the AgentScope evaluation module. .. GENERATED FROM PYTHON SOURCE LINES 53-75 .. code-block:: Python TOY_BENCHMARK = [ { "id": "math_problem_1", "question": "What is 2 + 2?", "ground_truth": 4.0, "tags": { "difficulty": "easy", "category": "math", }, }, { "id": "math_problem_2", "question": "What is 12345 + 54321 + 6789 + 9876?", "ground_truth": 83331, "tags": { "difficulty": "medium", "category": "math", }, }, ] .. GENERATED FROM PYTHON SOURCE LINES 76-82 From Tasks, Solutions and Metrics to Benchmark ~~~~~~~~~~~~~~~~~~~ - A ``SolutionOutput`` contains all information generated by the agent, including the trajectory and final output. - A ``Metric`` represents a single evaluation callable instance that compares the generated solution (e.g., trajectory or final output) to the ground truth. In the toy example, we define a metric that simply checks whether the ``output`` field in the solution matches the ground truth. .. GENERATED FROM PYTHON SOURCE LINES 82-122 .. code-block:: Python from agentscope.evaluate import ( SolutionOutput, MetricBase, MetricResult, MetricType, ) class CheckEqual(MetricBase): def __init__( self, ground_truth: float, ): super().__init__( name="math check number equal", metric_type=MetricType.NUMERICAL, description="Toy metric checking if two numbers are equal", categories=[], ) self.ground_truth = ground_truth def __call__( self, solution: SolutionOutput, ) -> MetricResult: if solution.output == self.ground_truth: return MetricResult( name=self.name, result=1.0, message="Correct", ) else: return MetricResult( name=self.name, result=0.0, message="Incorrect", ) .. GENERATED FROM PYTHON SOURCE LINES 123-125 - A ``Task`` is a unit in the benchmark that includes all information for the agent to execute and evaluate (e.g., input/query and its ground truth). - A ``Benchmark`` organizes multiple tasks for systematic evaluation. .. GENERATED FROM PYTHON SOURCE LINES 125-173 .. code-block:: Python from typing import Generator from agentscope.evaluate import ( Task, BenchmarkBase, ) class ToyBenchmark(BenchmarkBase): def __init__(self): super().__init__( name="Toy bench", description="A toy benchmark for demonstrating the evaluation module.", ) self.dataset = self._load_data() @staticmethod def _load_data() -> list[Task]: dataset = [] for item in TOY_BENCHMARK: dataset.append( Task( id=item["id"], input=item["question"], ground_truth=item["ground_truth"], tags=item.get("tags", {}), metrics=[ CheckEqual(item["ground_truth"]), ], metadata={}, ), ) return dataset def __iter__(self) -> Generator[Task, None, None]: """Iterate over the benchmark.""" for task in self.dataset: yield task def __getitem__(self, index: int) -> Task: """Get a task by index.""" return self.dataset[index] def __len__(self) -> int: """Get the length of the benchmark.""" return len(self.dataset) .. GENERATED FROM PYTHON SOURCE LINES 174-185 Evaluators ~~~~~~~~~~ Evaluators manage the evaluation process. They can automatically iterate through the tasks in the benchmark and feed each task into a solution-generation function, where developers need to define the logic for running agents and retrieving the execution result and trajectory. Below is an example of running ``GeneralEvaluator`` with our toy benchmark. If there is a large benchmark and the developer wants to get the evaluation more efficiently through parallelization, ``RayEvaluator`` is available as a built-in solution as well. .. GENERATED FROM PYTHON SOURCE LINES 185-258 .. code-block:: Python import os import asyncio from typing import Callable from pydantic import BaseModel from agentscope.message import Msg from agentscope.model import DashScopeChatModel from agentscope.formatter import DashScopeChatFormatter from agentscope.agent import ReActAgent from agentscope.evaluate import ( GeneralEvaluator, FileEvaluatorStorage, ) class ToyBenchAnswerFormat(BaseModel): answer_as_number: float async def toy_solution_generation( task: Task, pre_hook: Callable, ) -> SolutionOutput: agent = ReActAgent( name="Friday", sys_prompt="You are a helpful assistant named Friday. " "Your target is to solve the given task with your tools. " "Try to solve the task as best as you can.", model=DashScopeChatModel( api_key=os.environ.get("DASHSCOPE_API_KEY"), model_name="qwen-max", stream=False, ), formatter=DashScopeChatFormatter(), ) agent.register_instance_hook( "pre_print", "save_logging", pre_hook, ) msg_input = Msg("user", task.input, role="user") res = await agent( msg_input, structured_model=ToyBenchAnswerFormat, ) return SolutionOutput( success=True, output=res.metadata.get("answer_as_number", None), trajectory=[], ) async def main() -> None: evaluator = GeneralEvaluator( name="ACEbench evaluation", benchmark=ToyBenchmark(), # Repeat how many times n_repeat=1, storage=FileEvaluatorStorage( save_dir="./results", ), # How many workers to use n_workers=1, ) # Run the evaluation await evaluator.run(toy_solution_generation) asyncio.run(main()) .. rst-class:: sphx-glr-script-out .. code-block:: none Friday: The answer to 2 + 2 is 4. Friday: The sum of 12345, 54321, 6789, and 9876 is 83331. Repeat ID: 0 Metric: math check number equal Type: MetricType.NUMERICAL Involved tasks: 2 Completed tasks: 2 Incomplete tasks: 0 Aggregation: { "mean": 1.0, "max": 1.0, "min": 1.0 } .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 7.152 seconds) .. _sphx_glr_download_tutorial_task_eval.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: task_eval.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: task_eval.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: task_eval.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_