agentscope.tts¶

The TTS (Text-to-Speech) module.

class TTSModelBase[source]¶

Bases: ABC

Base class for TTS models in AgentScope.

This base class provides general abstraction for both realtime and non-realtime TTS models (depending on whether streaming input is supported).

For non-realtime TTS models, the synthesize method is used to synthesize speech from the input text. You only need to implement the _call_api method to handle the TTS API calls.

For realtime TTS models, its lifecycle is managed via the async context manager or calling connect and close methods. The push method will append text chunks and return the received TTS response, while the synthesize method will block until the full speech is synthesized. You need to implement the connect, close, and _call_api methods to handle the TTS API calls and resource management.

supports_streaming_input: bool = False¶: If the TTS model class supports streaming input.

__init__(model_name, stream)[source]¶

Initialize the TTS model base class.

Parameters:

model_name (str) – The name of the TTS model
stream (bool) – Whether to use streaming synthesis if supported by the model.

Return type:

None

model_name: str¶: The name of the TTS model.

stream: bool¶: Whether to use streaming synthesis if supported by the model.

async connect()[source]¶

Connect to the TTS model and initialize resources. For non-realtime TTS models, leave this method empty.

Note

Only needs to be implemented for realtime TTS models.

Return type:: None

async close()[source]¶

Close the connection to the TTS model and clean up resources. For non-realtime TTS models, leave this method empty.

Note

Only needs to be implemented for realtime TTS models.

Return type:: None

async push(msg, **kwargs)[source]¶

Append text to be synthesized and return the received TTS response. Note this method is non-blocking, and maybe return an empty response if no audio is received yet.

To receive all the synthesized speech, call the synthesize method after pushing all the text chunks.

Note

Only needs to be implemented for realtime TTS models.

Parameters:

msg (Msg) – The message to be synthesized. The msg.id identifies the streaming input request.
**kwargs (Any) – Additional keyword arguments to pass to the TTS API call.

Returns:

The TTSResponse containing audio block.

Return type:

TTSResponse

abstract async synthesize(msg=None, **kwargs)[source]¶

Synthesize speech from the appended text. Different from the push method, this method will block until the full speech is synthesized.

Parameters:

msg (Msg | None, defaults to None) – The message to be synthesized. If None, this method will wait for all previously pushed text to be synthesized, and return the last synthesized TTSResponse.
kwargs (Any)

Returns:

The TTSResponse containing audio blocks, or an async generator yielding TTSResponse objects in streaming mode.

Return type:

TTSResponse | AsyncGenerator[TTSResponse, None]

class TTSResponse[source]¶

Bases: DictMixin

The response of TTS models.

content: AudioBlock | None¶: The content of the TTS response, which contains audio block

__init__(content, id=<factory>, created_at=<factory>, type=<factory>, usage=<factory>, metadata=<factory>, is_last=True)¶

Parameters:

content (AudioBlock | None)
id (str)
created_at (str)
type (Literal['tts'])
usage (TTSUsage | None)
metadata (dict[str, str | int | float | bool | None | list[JSONSerializableObject] | dict[str, JSONSerializableObject]] | None)
is_last (bool)

Return type:

None

id: str¶: The unique identifier of the response.

created_at: str¶: When the response was created.

type: Literal['tts']¶: The type of the response, which is always ‘tts’.

usage: TTSUsage | None¶: The usage information of the TTS response, if available.

metadata: dict[str, str | int | float | bool | None | list[JSONSerializableObject] | dict[str, JSONSerializableObject]] | None¶: The metadata of the TTS response.

is_last: bool = True¶: Whether this is the last response in a stream of TTS responses.

class TTSUsage[source]¶

Bases: DictMixin

The usage of a TTS model API invocation.

input_tokens: int¶: The number of input tokens.

output_tokens: int¶: The number of output tokens.

time: float¶: The time used in seconds.

type: Literal['tts']¶: The type of the usage, must be tts.

__init__(input_tokens, output_tokens, time, type=<factory>)¶

Parameters:

input_tokens (int)
output_tokens (int)
time (float)
type (Literal['tts'])

Return type:

None

class DashScopeTTSModel[source]¶

Bases: TTSModelBase

DashScope TTS model implementation using MultiModalConversation API. For more details, please see the official document.

supports_streaming_input: bool = False¶: Whether the model supports streaming input.

__init__(api_key, model_name='qwen3-tts-flash', voice='Cherry', language_type='Auto', stream=True, generate_kwargs=None)[source]¶

Initialize the DashScope SDK TTS model.

Note

More details about the parameters, such as model_name,

voice, and language_type can be found in the official document.

Parameters:

api_key (str) – The DashScope API key. Required.
model_name (str, defaults to “qwen3-tts-flash”) – The TTS model name. Supported models are qwen3-tts-flash, qwen-tts, etc.
voice (Literal[“Cherry”, “Serena”, “Ethan”, “Chelsie”] | str, defaults to “Cherry”) – The voice to use. Supported voices are “Cherry”, “Serena”, “Ethan”, “Chelsie”, etc.
language_type (str, default to “Auto”) – The language type. Should match the text language for correct pronunciation and natural intonation.
generate_kwargs (dict[str, JSONSerializableObject] | None, optional) – The extra keyword arguments used in Dashscope TTS API generation, e.g. temperature, seed.
stream (bool)

Return type:

None

async synthesize(msg=None, **kwargs)[source]¶

Call the DashScope TTS API to synthesize speech from text.

Parameters:

msg (Msg | None, optional) – The message to be synthesized.
**kwargs (Any) – Additional keyword arguments to pass to the TTS API call.

Returns:

The TTS response or an async generator yielding TTSResponse objects in streaming mode.

Return type:

TTSResponse | AsyncGenerator[TTSResponse, None]

class DashScopeRealtimeTTSModel[source]¶

Bases: TTSModelBase

TTS implementation for DashScope Qwen Realtime TTS API, which supports streaming input. The supported models include “qwen-3-tts-flash-realtime”, “qwen-tts-realtime”, etc.

For more details, please see the official document.

Note

The DashScopeRealtimeTTSModel can only handle one streaming

input request at a time, and cannot process multiple streaming input requests concurrently. For example, it cannot handle input sequences like [msg_1_chunk0, msg_1_chunk1, msg_2_chunk0], where the prefixes “msg_x” indicate different streaming input requests.

supports_streaming_input: bool = True¶: Whether the model supports streaming input.

__init__(api_key, model_name='qwen3-tts-flash-realtime', voice='Cherry', stream=True, cold_start_length=None, cold_start_words=None, client_kwargs=None, generate_kwargs=None)[source]¶

Initialize the DashScope TTS model by specifying the model, voice, and other parameters.

Note

More details about the parameters, such as model_name,

voice, and mode can be found in the official document.

Note

You can use cold_start_length and cold_start_words

simultaneously to set both character and word thresholds for the first TTS request. For Chinese text, word segmentation (based on spaces) may not be effective.

Parameters:

api_key (str) – The DashScope API key.
model_name (str, defaults to “qwen-tts-realtime”) – The TTS model name, e.g. “qwen3-tts-flash-realtime”, “qwen-tts-realtime”, etc.
voice (Literal[“Cherry”, “Serena”, “Ethan”, “Chelsie”] | str, defaults to “Cherry”.) –
The voice to use for synthesis. Refer to official document for the supported voices for each model.
stream (bool, defaults to True) – Whether to use streaming synthesis.
cold_start_length (int | None, optional) – The minimum length send threshold for the first TTS request, ensuring there is no pause in the synthesized speech for too short input text. The length is measured in number of characters.
cold_start_words (int | None, optional) – The minimum words send threshold for the first TTS request, ensuring there is no pause in the synthesized speech for too short input text. The words are identified by spaces in the text.
client_kwargs (dict[str, JSONSerializableObject] | None, optional) – The extra keyword arguments to initialize the DashScope realtime tts client.
generate_kwargs (dict[str, JSONSerializableObject] | None, optional) – The extra keyword arguments used in DashScope realtime tts API generation.

Return type:

None

async connect()[source]¶

Initialize the DashScope TTS model and establish connection.

Return type:: None

async close()[source]¶

Close the TTS model and clean up resources.

Return type:: None

async push(msg, **kwargs)[source]¶

Append text to be synthesized and return the received TTS response. Note this method is non-blocking, and maybe return an empty response if no audio is received yet.

To receive all the synthesized speech, call the synthesize method after pushing all the text chunks.

Parameters:

msg (Msg) – The message to be synthesized. The msg.id identifies the streaming input request.
**kwargs (Any) – Additional keyword arguments to pass to the TTS API call.

Returns:

The TTSResponse containing audio blocks.

Return type:

TTSResponse

async synthesize(msg=None, **kwargs)[source]¶

Append text to be synthesized and return TTS response.

Parameters:

msg (Msg | None, optional) – The message to be synthesized.
**kwargs (Any) – Additional keyword arguments to pass to the TTS API call.

Returns:

The TTSResponse object in non-streaming mode, or an async generator yielding TTSResponse objects in streaming mode.

Return type:

TTSResponse | AsyncGenerator[TTSResponse, None]

class GeminiTTSModel[source]¶

Bases: TTSModelBase

Gemini TTS model implementation. For more details, please see the official document.

supports_streaming_input: bool = False¶: Whether the model supports streaming input.

__init__(api_key, model_name='gemini-2.5-flash-preview-tts', voice='Kore', stream=True, client_kwargs=None, generate_kwargs=None)[source]¶

Initialize the Gemini TTS model.

Note

More details about the parameters, such as model_name and voice can be found in the official document.

Parameters:

api_key (str) – The Gemini API key.
model_name (str, defaults to “gemini-2.5-flash-preview-tts”) – The TTS model name. Supported models are “gemini-2.5-flash-preview-tts”, “gemini-2.5-pro-preview-tts”, etc.
voice (Literal[“Zephyr”, “Kore”, “Orus”, “Autonoe”] | str, defaults to “Kore”) – The voice name to use. Supported voices are “Zephyr”, “Kore”, “Orus”, “Autonoe”, etc.
stream (bool, defaults to True) – Whether to use streaming synthesis if supported by the model.
client_kwargs (dict[str, JSONSerializableObject] | None, optional) – The extra keyword arguments to initialize the Gemini client.
generate_kwargs (dict[str, JSONSerializableObject] | None, optional) – The extra keyword arguments used in Gemini API generation, e.g. temperature, seed.

Return type:

None

async synthesize(msg=None, **kwargs)[source]¶

Append text to be synthesized and return TTS response.

Parameters:

msg (Msg | None, optional) – The message to be synthesized.
**kwargs (Any) – Additional keyword arguments to pass to the TTS API call.

Returns:

The TTSResponse object in non-streaming mode, or an async generator yielding TTSResponse objects in streaming mode.

Return type:

TTSResponse | AsyncGenerator[TTSResponse, None]

class OpenAITTSModel[source]¶

Bases: TTSModelBase

OpenAI TTS model implementation. For more details, please see the official document.

supports_streaming_input: bool = False¶: If the TTS model class supports streaming input.

__init__(api_key, model_name='gpt-4o-mini-tts', voice='alloy', stream=True, client_kwargs=None, generate_kwargs=None)[source]¶

Initialize the OpenAI TTS model.

Note

More details about the parameters, such as model_name and voice can be found in the official document.

Parameters:

api_key (str) – The OpenAI API key.
model_name (str, defaults to “gpt-4o-mini-tts”) – The TTS model name. Supported models are “gpt-4o-mini-tts”, “tts-1”, etc.
(`Literal["alloy" (voice) –

defaults to “alloy”):
The voice to use. Supported voices are “alloy”, “ash”, “ballad”, “coral”, etc.
"ash" –

defaults to “alloy”):
The voice to use. Supported voices are “alloy”, “ash”, “ballad”, “coral”, etc.
"ballad" –

defaults to “alloy”):
The voice to use. Supported voices are “alloy”, “ash”, “ballad”, “coral”, etc.
` ("coral"] | str) –

defaults to “alloy”):
The voice to use. Supported voices are “alloy”, “ash”, “ballad”, “coral”, etc.
voice (Literal['alloy', 'ash', 'ballad', 'coral'] | str)
stream (bool)
client_kwargs (dict | None)
generate_kwargs (dict[str, str | int | float | bool | None | list[JSONSerializableObject] | dict[str, JSONSerializableObject]] | None)

Return type:

None

:param :

defaults to “alloy”):: The voice to use. Supported voices are “alloy”, “ash”, “ballad”, “coral”, etc.

Parameters:

client_kwargs (dict | None, default None) – The extra keyword arguments to initialize the OpenAI client.
generate_kwargs (dict[str, JSONSerializableObject] | None, optional) – The extra keyword arguments used in OpenAI API generation, e.g. temperature, seed.
api_key (str)
model_name (str)
voice (Literal['alloy', 'ash', 'ballad', 'coral'] | str)
stream (bool)

Return type:

None

async synthesize(msg=None, **kwargs)[source]¶

Append text to be synthesized and return TTS response.

Parameters:

msg (Msg | None, optional) – The message to be synthesized.
**kwargs (Any) – Additional keyword arguments to pass to the TTS API call.

Returns:

The TTSResponse object in non-streaming mode, or an async generator yielding TTSResponse objects in streaming mode.

Return type:

TTSResponse | AsyncGenerator[TTSResponse, None]