MultiModality

In this section, we will show how to build multimodal applications in AgentScope with two examples.

The first example demonstrates how to use vision LLMs within an agent, and
the second example shows how to use text to image generation within an agent.

Building Vision Agent

For most LLM APIs, the vision and non-vision LLMs share the same APIs, and only differ in the input format. In AgentScope, the format function of the model wrapper is responsible for converting the input Msg objects into the required format for vision LLMs.

That is, we only need to specify the vision LLM without changing the agent’s code. Taking “qwen-vl-max” as an example, its model configuration is the same as the non-vision LLMs in DashScope Chat API.

Refer to section Model APIs for the vision LLM APIs supported in AgentScope.

model_config = {
    "config_name": "my-qwen-vl",
    "model_type": "dashscope_multimodal",
    "model_name": "qwen-vl-max",
}

As usual, we initialize AgentScope with the above configuration, and create a new agent with the vision LLM.

from agentscope.agents import DialogAgent
import agentscope

agentscope.init(model_configs=model_config)

agent = DialogAgent(
    name="Monday",
    sys_prompt="You're a helpful assistant named Monday.",
    model_config_name="my-qwen-vl",
)

2025-08-07 11:24:55 | INFO     | agentscope.manager._model:load_model_configs:138 - Load configs for model wrapper: my-qwen-vl

To communicate with the vision agent with pictures, Msg class provides an url field. You can put both local or online image URL(s) in the url field.

Let’s first create an image with matplotlib

import matplotlib.pyplot as plt

plt.figure(figsize=(6, 6))
plt.bar(range(3), [2, 1, 4])
plt.xticks(range(3), ["Alice", "Bob", "Charlie"])
plt.title("The Apples Each Person Has in 2023")
plt.xlabel("Number of Apples")

plt.show()
plt.savefig("./bar.png")

Then, we create a Msg object with the image URL

from agentscope.message import Msg

msg = Msg(
    name="User",
    content="Describe the attached image for me.",
    role="user",
    url="./bar.png",
)

After that, we can send the message to the vision agent and get the response.

response = agent(msg)

2025-08-07 11:24:55 | ERROR    | agentscope.message.msg:__init__:112 - The url argument will be deprecated in the future. Consider using the ContentBlock instead to attach files to the message
Monday: The image is a bar chart titled **"The Apples Each Person Has in 2023"**. It displays the number of apples that three individuals—Alice, Bob, and Charlie—possess. Here are the details:

- **X-axis**: Labeled as "Number of Apples," it represents the names of the individuals: Alice, Bob, and Charlie.
- **Y-axis**: Represents the quantity of apples, ranging from 0 to 4.

### Data Representation:
- **Alice** has **2 apples**.
- **Bob** has **1 apple**.
- **Charlie** has **4 apples**.

### Visual Characteristics:
- The bars are colored in blue.
- Charlie's bar is the tallest, indicating he has the most apples.
- Bob's bar is the shortest, indicating he has the fewest apples.

This chart effectively compares the number of apples each person has in a clear and straightforward manner.

Total running time of the script: (0 minutes 10.320 seconds)

Gallery generated by Sphinx-Gallery