.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorial/task_eval_openjudge.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_tutorial_task_eval_openjudge.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorial_task_eval_openjudge.py:


Evaluation with OpenJudge
=========================

This guide introduces how to use [OpenJudge](https://github.com/agentscope-ai/OpenJudge) graders as AgentScope metrics to evaluate your multi-agent applications.
OpenJudge is a comprehensive evaluation system designed to assess the quality of LLM applications. By integrating OpenJudge into AgentScope, you can extend AgentScope's native evaluation capabilities from basic execution checks to deep, semantic quality analysis.


.. note::
   Install dependencies before running:

   .. code-block:: bash

       pip install agentscope py-openjudge


Overview
--------
While AgentScope provides a robust `MetricBase` for defining evaluation logic, implementing complex, semantic-level metrics (like "Hallucination Detection" or "Response Relevance") often requires
significant effort in prompt engineering and pipeline construction.

Integrating OpenJudge brings three dimensions of capability extension to AgentScope:

1.  **Enhance Evaluation Depth:**: Move beyond simple success/failure checks to multi-dimensional assessments (Accuracy, Safety, Tone, etc.).
2.  **Leverage Verified Graders**: Instantly access 50+ pre-built, expert-level graders without writing custom evaluation prompts, see the [OpenJudge documentation](https://agentscope-ai.github.io/OpenJudge/built_in_graders/overview/) for details.
3.  **Closed-loop Iteration**: Seamlessly embed OpenJudge into AgentScope's `Benchmark`, obtaining quantitative scores and qualitative reasoning.


How to Evaluate with OpenJudge
--------------------

We are going to build a simple QA benchmark to demonstrate how to use the AgentScope evaluation module by integrating OpenJudge's graders.

.. GENERATED FROM PYTHON SOURCE LINES 37-68

.. code-block:: Python

    QA_BENCHMARK_DATASET = [
        {
            "id": "qa_task_1",
            "question": "What are the health benefits of regular exercise?",
            "reference_output": "Regular exercise improves cardiovascular health, strengthens muscles and bones, "
            "helps maintain a healthy weight, and can improve mental health by reducing anxiety and depression.",
            "ground_truth": "Answers should cover physical and mental health benefits",
            "difficulty": "medium",
            "category": "health",
        },
        {
            "id": "qa_task_2",
            "question": "Describe the main causes of climate change.",
            "reference_output": "Climate change is primarily caused by increased concentrations of greenhouse gases "
            "in the atmosphere due to human activities like burning fossil fuels, deforestation, and industrial processes.",
            "ground_truth": "Answers should mention greenhouse gases and human activities",
            "difficulty": "hard",
            "category": "environment",
        },
        {
            "id": "qa_task_3",
            "question": "What is the significance of the Turing Test in AI?",
            "reference_output": "The Turing Test, proposed by Alan Turing, is a measure of a machine's ability to exhibit"
            " intelligent behavior equivalent to, or indistinguishable from, that of a human.",
            "ground_truth": "Should mention Alan Turing, purpose of the test, and its implications for AI",
            "difficulty": "hard",
            "category": "technology",
        },
    ]


.. GENERATED FROM PYTHON SOURCE LINES 69-79

AgentScope Metric vs. OpenJudge Grader
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To make OpenJudge compatible with AgentScope, we need an adapter that inherits from
AgentScope's ``MetricBase`` and acts as a bridge to OpenJudge's ``BaseGrader``.

* **AgentScope Metric**: A generic unit of evaluation that accepts a ``SolutionOutput`` and returns a ``MetricResult``.
* **OpenJudge Grader**: A specialized evaluation unit (e.g., ``RelevanceGrader``) that requires specific, semantic inputs (like ``query``, ``response``, ``context``), and returns a ``GraderResult``.

This "Adapter" allows you to plug *any* OpenJudge grader into your AgentScope benchmark seamlessly.


.. GENERATED FROM PYTHON SOURCE LINES 81-178

.. code-block:: Python

    from openjudge.graders.base_grader import BaseGrader
    from openjudge.graders.schema import GraderScore, GraderError
    from openjudge.utils.mapping import parse_data_with_mapper
    from agentscope.evaluate import (
        MetricBase,
        MetricType,
        MetricResult,
        SolutionOutput,
    )


    class OpenJudgeMetric(MetricBase):
        """
        A wrapper that converts an OpenJudge grader into an AgentScope Metric.
        """

        def __init__(
            self,
            grader_cls: type[BaseGrader],
            data: dict,
            mapper: dict,
            name: str | None = None,
            description: str | None = None,
            **grader_kwargs,
        ):
            # Initialize the OpenJudge grader
            self.grader = grader_cls(**grader_kwargs)

            super().__init__(
                name=name or self.grader.name,
                metric_type=MetricType.NUMERICAL,
                description=description or self.grader.description,
            )

            self.data = data
            self.mapper = mapper

        async def __call__(self, solution: SolutionOutput) -> MetricResult:
            """Execute the wrapped OpenJudge grader against the agent solution."""
            if not solution.success:
                return MetricResult(
                    name=self.name,
                    result=0.0,
                    message="Solution failed",
                )

            try:
                # 1. Context Construction
                # Combine Static Task Context (item) and Dynamic Agent Output (solution)
                combined_data = {
                    "data": self.data,
                    "solution": {
                        "output": solution.output,
                        "meta": solution.meta,
                        "trajectory": getattr(solution, "trajectory", []),
                    },
                }

                # 2. Data Mapping
                # Use the mapper to extract 'query', 'response', 'context' from the combined data
                grader_inputs = parse_data_with_mapper(
                    combined_data,
                    self.mapper,
                )

                # 3. Evaluation Execution
                result = await self.grader.aevaluate(**grader_inputs)

                # 4. Result Formatting
                if isinstance(result, GraderScore):
                    return MetricResult(
                        name=self.name,
                        result=result.score,
                        # We preserve the detailed reasoning provided by OpenJudge
                        message=result.reason or "",
                    )
                elif isinstance(result, GraderError):
                    return MetricResult(
                        name=self.name,
                        result=0.0,
                        message=f"Error: {result.error}",
                    )
                else:
                    return MetricResult(
                        name=self.name,
                        result=0.0,
                        message="Unknown result type",
                    )

            except Exception as e:
                return MetricResult(
                    name=self.name,
                    result=0.0,
                    message=f"Exception: {str(e)}",
                )


.. GENERATED FROM PYTHON SOURCE LINES 179-194

From OpenJudge's Graders to AgentScope's Benchmark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenJudge provides a rich collection of built-in graders. In this example, we select two
common graders suitable for Question-Answering tasks:

* **RelevanceGrader**: Evaluates whether the agent's response directly addresses the user's query.
* **CorrectnessGrader**: Verifies the factual accuracy of the response against a provided ground truth.

.. tip::
   OpenJudge offers 50+ built-in graders covering diverse dimensions like **Hallucination**, **Safety**, **Code Quality**,
   and **JSON Formatting**. Please refer to the `OpenJudge Documentation <https://agentscope-ai.github.io/OpenJudge/built_in_graders/overview/>`_
   for the full list of available graders.

.. note::
   Ensure you have set your ``DASHSCOPE_API_KEY`` environment variable before running the example below.

.. GENERATED FROM PYTHON SOURCE LINES 196-278

.. code-block:: Python

    import os
    from typing import Generator
    from openjudge.graders.common.relevance import RelevanceGrader
    from openjudge.graders.common.correctness import CorrectnessGrader
    from agentscope.evaluate import (
        Task,
        BenchmarkBase,
    )


    class QABenchmark(BenchmarkBase):
        """A benchmark for QA tasks using OpenJudge metrics."""

        def __init__(self):
            super().__init__(
                name="QA Quality Benchmark",
                description="Benchmark to evaluate QA systems using OpenJudge grader classes",
            )
            self.dataset = self._load_data()

        def _load_data(self):
            tasks = []
            # Configuration for LLM-based graders
            # Ensure OPENAI_API_KEY is set in your environment variables
            model_config = {
                "model": "qwen3-32b",
                "api_key": os.environ.get("DASHSCOPE_API_KEY"),
                "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
            }

            for data in QA_BENCHMARK_DATASET:
                # Define the Mapping: Left is OpenJudge key, Right is AgentScope path
                mapper = {
                    "query": "data.input",
                    "response": "solution.output",
                    "context": "data.ground_truth",
                    "reference_response": "data.reference_output",
                }

                # Instantiate Metrics via Wrapper
                metrics = [
                    OpenJudgeMetric(
                        grader_cls=RelevanceGrader,
                        data=data,
                        mapper=mapper,
                        name="Relevance",
                        model=model_config,
                    ),
                    OpenJudgeMetric(
                        grader_cls=CorrectnessGrader,
                        data=data,
                        mapper=mapper,
                        name="Correctness",
                        model=model_config,
                    ),
                ]

                # Create Task
                task = Task(
                    id=data["id"],
                    input=data["question"],
                    ground_truth=data["ground_truth"],
                    metrics=metrics,
                )

                tasks.append(task)

            return tasks

        def __iter__(self) -> Generator[Task, None, None]:
            """Iterate over the benchmark."""
            yield from self.dataset

        def __getitem__(self, index: int) -> Task:
            """Get a task by index."""
            return self.dataset[index]

        def __len__(self) -> int:
            """Get the length of the benchmark."""
            return len(self.dataset)


.. GENERATED FROM PYTHON SOURCE LINES 279-284

Run Evaluation
~~~~~~~~~~
Finally, use AgentScope's ``GeneralEvaluator`` to run the benchmark on a QA agent.
The results will include both the **Quantitative Score** and the **Qualitative Reasoning**
from the OpenJudge graders.

.. GENERATED FROM PYTHON SOURCE LINES 286-342

.. code-block:: Python


    from typing import Callable

    from agentscope.agent import ReActAgent
    from agentscope.evaluate import GeneralEvaluator
    from agentscope.evaluate import FileEvaluatorStorage
    from agentscope.formatter import DashScopeChatFormatter
    from agentscope.message import Msg
    from agentscope.model import OpenAIChatModel


    async def qa_agent(task: Task, pre_hook: Callable) -> SolutionOutput:
        """Solution function that generates answers to QA tasks."""

        model = OpenAIChatModel(
            model_name="qwen3-32b",
            api_key=os.getenv("DASHSCOPE_API_KEY"),
        )

        # Create a QA agent
        agent = ReActAgent(
            name="QAAgent",
            sys_prompt="You are an expert at answering questions. Provide clear, accurate, and comprehensive answers.",
            model=model,
            formatter=DashScopeChatFormatter(),
        )

        # Generate response
        msg_input = Msg(name="User", content=task.input, role="user")
        response = await agent(msg_input)
        response_text = response.content

        return SolutionOutput(
            success=True,
            output=response_text,
            trajectory=[
                task.input,
                response_text,
            ],  # Store the interaction trajectory
        )


    async def main() -> None:
        evaluator = GeneralEvaluator(
            name="OpenJudge Integration Demo",
            benchmark=QABenchmark(),
            # Repeat how many times
            n_repeat=1,
            storage=FileEvaluatorStorage(
                save_dir="./results",
            ),
            # How many workers to use
            n_workers=1,
        )

        await evaluator.run(qa_agent)


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.273 seconds)


.. _sphx_glr_download_tutorial_task_eval_openjudge.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: task_eval_openjudge.ipynb <task_eval_openjudge.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: task_eval_openjudge.py <task_eval_openjudge.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: task_eval_openjudge.zip <task_eval_openjudge.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_