agentscope.tts

The TTS (Text-to-Speech) module.

class TTSModelBase[源代码]

基类:ABC

Base class for TTS models in AgentScope.

This base class provides general abstraction for both realtime and non-realtime TTS models (depending on whether streaming input is supported).

For non-realtime TTS models, the synthesize method is used to synthesize speech from the input text. You only need to implement the _call_api method to handle the TTS API calls.

For realtime TTS models, its lifecycle is managed via the async context manager or calling connect and close methods. The push method will append text chunks and return the received TTS response, while the synthesize method will block until the full speech is synthesized. You need to implement the connect, close, and _call_api methods to handle the TTS API calls and resource management.

supports_streaming_input: bool = False

If the TTS model class supports streaming input.

__init__(model_name, stream)[源代码]

Initialize the TTS model base class.

参数:
  • model_name (str) -- The name of the TTS model

  • stream (bool) -- Whether to use streaming synthesis if supported by the model.

返回类型:

None

model_name: str

The name of the TTS model.

stream: bool

Whether to use streaming synthesis if supported by the model.

async connect()[源代码]

Connect to the TTS model and initialize resources. For non-realtime TTS models, leave this method empty.

备注

Only needs to be implemented for realtime TTS models.

返回类型:

None

async close()[源代码]

Close the connection to the TTS model and clean up resources. For non-realtime TTS models, leave this method empty.

备注

Only needs to be implemented for realtime TTS models.

返回类型:

None

async push(msg, **kwargs)[源代码]

Append text to be synthesized and return the received TTS response. Note this method is non-blocking, and maybe return an empty response if no audio is received yet.

To receive all the synthesized speech, call the synthesize method after pushing all the text chunks.

备注

Only needs to be implemented for realtime TTS models.

参数:
  • msg (Msg) -- The message to be synthesized. The msg.id identifies the streaming input request.

  • **kwargs (Any) -- Additional keyword arguments to pass to the TTS API call.

返回:

The TTSResponse containing audio block.

返回类型:

TTSResponse

abstract async synthesize(msg=None, **kwargs)[源代码]

Synthesize speech from the appended text. Different from the push method, this method will block until the full speech is synthesized.

参数:
  • msg (Msg | None, defaults to None) -- The message to be synthesized. If None, this method will wait for all previously pushed text to be synthesized, and return the last synthesized TTSResponse.

  • kwargs (Any)

返回:

The TTSResponse containing audio blocks, or an async generator yielding TTSResponse objects in streaming mode.

返回类型:

TTSResponse | AsyncGenerator[TTSResponse, None]

class TTSResponse[源代码]

基类:DictMixin

The response of TTS models.

content: AudioBlock | None

The content of the TTS response, which contains audio block

__init__(content, id=<factory>, created_at=<factory>, type=<factory>, usage=<factory>, metadata=<factory>, is_last=True)
参数:
  • content (AudioBlock | None)

  • id (str)

  • created_at (str)

  • type (Literal['tts'])

  • usage (TTSUsage | None)

  • metadata (dict[str, str | int | float | bool | None | list[JSONSerializableObject] | dict[str, JSONSerializableObject]] | None)

  • is_last (bool)

返回类型:

None

id: str

The unique identifier of the response.

created_at: str

When the response was created.

type: Literal['tts']

The type of the response, which is always 'tts'.

usage: TTSUsage | None

The usage information of the TTS response, if available.

metadata: dict[str, str | int | float | bool | None | list[JSONSerializableObject] | dict[str, JSONSerializableObject]] | None

The metadata of the TTS response.

is_last: bool = True

Whether this is the last response in a stream of TTS responses.

class TTSUsage[源代码]

基类:DictMixin

The usage of a TTS model API invocation.

input_tokens: int

The number of input tokens.

output_tokens: int

The number of output tokens.

time: float

The time used in seconds.

type: Literal['tts']

The type of the usage, must be tts.

__init__(input_tokens, output_tokens, time, type=<factory>)
参数:
  • input_tokens (int)

  • output_tokens (int)

  • time (float)

  • type (Literal['tts'])

返回类型:

None

class DashScopeTTSModel[源代码]

基类:TTSModelBase

DashScope TTS model implementation using MultiModalConversation API. For more details, please see the official document.

supports_streaming_input: bool = False

Whether the model supports streaming input.

__init__(api_key, model_name='qwen3-tts-flash', voice='Cherry', language_type='Auto', stream=True, generate_kwargs=None)[源代码]

Initialize the DashScope SDK TTS model.

备注

More details about the parameters, such as model_name,

voice, and language_type can be found in the official document.

参数:
  • api_key (str) -- The DashScope API key. Required.

  • model_name (str, defaults to "qwen3-tts-flash") -- The TTS model name. Supported models are qwen3-tts-flash, qwen-tts, etc.

  • voice (Literal["Cherry", "Serena", "Ethan", "Chelsie"] | str, defaults to "Cherry") -- The voice to use. Supported voices are "Cherry", "Serena", "Ethan", "Chelsie", etc.

  • language_type (str, default to "Auto") -- The language type. Should match the text language for correct pronunciation and natural intonation.

  • generate_kwargs (dict[str, JSONSerializableObject] | None, optional) -- The extra keyword arguments used in Dashscope TTS API generation, e.g. temperature, seed.

  • stream (bool)

返回类型:

None

async synthesize(msg=None, **kwargs)[源代码]

Call the DashScope TTS API to synthesize speech from text.

参数:
  • msg (Msg | None, optional) -- The message to be synthesized.

  • **kwargs (Any) -- Additional keyword arguments to pass to the TTS API call.

返回:

The TTS response or an async generator yielding TTSResponse objects in streaming mode.

返回类型:

TTSResponse | AsyncGenerator[TTSResponse, None]

class DashScopeRealtimeTTSModel[源代码]

基类:TTSModelBase

TTS implementation for DashScope Qwen Realtime TTS API, which supports streaming input. The supported models include "qwen-3-tts-flash-realtime", "qwen-tts-realtime", etc.

For more details, please see the official document.

备注

The DashScopeRealtimeTTSModel can only handle one streaming

input request at a time, and cannot process multiple streaming input requests concurrently. For example, it cannot handle input sequences like [msg_1_chunk0, msg_1_chunk1, msg_2_chunk0], where the prefixes "msg_x" indicate different streaming input requests.

supports_streaming_input: bool = True

Whether the model supports streaming input.

__init__(api_key, model_name='qwen3-tts-flash-realtime', voice='Cherry', stream=True, cold_start_length=None, cold_start_words=None, client_kwargs=None, generate_kwargs=None)[源代码]

Initialize the DashScope TTS model by specifying the model, voice, and other parameters.

备注

More details about the parameters, such as model_name,

voice, and mode can be found in the official document.

备注

You can use cold_start_length and cold_start_words

simultaneously to set both character and word thresholds for the first TTS request. For Chinese text, word segmentation (based on spaces) may not be effective.

参数:
  • api_key (str) -- The DashScope API key.

  • model_name (str, defaults to "qwen-tts-realtime") -- The TTS model name, e.g. "qwen3-tts-flash-realtime", "qwen-tts-realtime", etc.

  • voice (Literal["Cherry", "Serena", "Ethan", "Chelsie"] | str, defaults to "Cherry".) --

    The voice to use for synthesis. Refer to official document for the supported voices for each model.

  • stream (bool, defaults to True) -- Whether to use streaming synthesis.

  • cold_start_length (int | None, optional) -- The minimum length send threshold for the first TTS request, ensuring there is no pause in the synthesized speech for too short input text. The length is measured in number of characters.

  • cold_start_words (int | None, optional) -- The minimum words send threshold for the first TTS request, ensuring there is no pause in the synthesized speech for too short input text. The words are identified by spaces in the text.

  • client_kwargs (dict[str, JSONSerializableObject] | None, optional) -- The extra keyword arguments to initialize the DashScope realtime tts client.

  • generate_kwargs (dict[str, JSONSerializableObject] | None, optional) -- The extra keyword arguments used in DashScope realtime tts API generation.

返回类型:

None

async connect()[源代码]

Initialize the DashScope TTS model and establish connection.

返回类型:

None

async close()[源代码]

Close the TTS model and clean up resources.

返回类型:

None

async push(msg, **kwargs)[源代码]

Append text to be synthesized and return the received TTS response. Note this method is non-blocking, and maybe return an empty response if no audio is received yet.

To receive all the synthesized speech, call the synthesize method after pushing all the text chunks.

参数:
  • msg (Msg) -- The message to be synthesized. The msg.id identifies the streaming input request.

  • **kwargs (Any) -- Additional keyword arguments to pass to the TTS API call.

返回:

The TTSResponse containing audio blocks.

返回类型:

TTSResponse

async synthesize(msg=None, **kwargs)[源代码]

Append text to be synthesized and return TTS response.

参数:
  • msg (Msg | None, optional) -- The message to be synthesized.

  • **kwargs (Any) -- Additional keyword arguments to pass to the TTS API call.

返回:

The TTSResponse object in non-streaming mode, or an async generator yielding TTSResponse objects in streaming mode.

返回类型:

TTSResponse | AsyncGenerator[TTSResponse, None]

class GeminiTTSModel[源代码]

基类:TTSModelBase

Gemini TTS model implementation. For more details, please see the official document.

supports_streaming_input: bool = False

Whether the model supports streaming input.

__init__(api_key, model_name='gemini-2.5-flash-preview-tts', voice='Kore', stream=True, client_kwargs=None, generate_kwargs=None)[源代码]

Initialize the Gemini TTS model.

备注

More details about the parameters, such as model_name and voice can be found in the official document.

参数:
  • api_key (str) -- The Gemini API key.

  • model_name (str, defaults to "gemini-2.5-flash-preview-tts") -- The TTS model name. Supported models are "gemini-2.5-flash-preview-tts", "gemini-2.5-pro-preview-tts", etc.

  • voice (Literal["Zephyr", "Kore", "Orus", "Autonoe"] | str, defaults to "Kore") -- The voice name to use. Supported voices are "Zephyr", "Kore", "Orus", "Autonoe", etc.

  • stream (bool, defaults to True) -- Whether to use streaming synthesis if supported by the model.

  • client_kwargs (dict[str, JSONSerializableObject] | None, optional) -- The extra keyword arguments to initialize the Gemini client.

  • generate_kwargs (dict[str, JSONSerializableObject] | None, optional) -- The extra keyword arguments used in Gemini API generation, e.g. temperature, seed.

返回类型:

None

async synthesize(msg=None, **kwargs)[源代码]

Append text to be synthesized and return TTS response.

参数:
  • msg (Msg | None, optional) -- The message to be synthesized.

  • **kwargs (Any) -- Additional keyword arguments to pass to the TTS API call.

返回:

The TTSResponse object in non-streaming mode, or an async generator yielding TTSResponse objects in streaming mode.

返回类型:

TTSResponse | AsyncGenerator[TTSResponse, None]

class OpenAITTSModel[源代码]

基类:TTSModelBase

OpenAI TTS model implementation. For more details, please see the official document.

supports_streaming_input: bool = False

If the TTS model class supports streaming input.

__init__(api_key, model_name='gpt-4o-mini-tts', voice='alloy', stream=True, client_kwargs=None, generate_kwargs=None)[源代码]

Initialize the OpenAI TTS model.

备注

More details about the parameters, such as model_name and voice can be found in the official document.

参数:
  • api_key (str) -- The OpenAI API key.

  • model_name (str, defaults to "gpt-4o-mini-tts") -- The TTS model name. Supported models are "gpt-4o-mini-tts", "tts-1", etc.

  • (`Literal["alloy" (voice) --

    defaults to "alloy"):

    The voice to use. Supported voices are "alloy", "ash", "ballad", "coral", etc.

  • "ash" --

    defaults to "alloy"):

    The voice to use. Supported voices are "alloy", "ash", "ballad", "coral", etc.

  • "ballad" --

    defaults to "alloy"):

    The voice to use. Supported voices are "alloy", "ash", "ballad", "coral", etc.

  • ` ("coral"] | str) --

    defaults to "alloy"):

    The voice to use. Supported voices are "alloy", "ash", "ballad", "coral", etc.

  • voice (Literal['alloy', 'ash', 'ballad', 'coral'] | str)

  • stream (bool)

  • client_kwargs (dict | None)

  • generate_kwargs (dict[str, str | int | float | bool | None | list[JSONSerializableObject] | dict[str, JSONSerializableObject]] | None)

返回类型:

None

:param :
defaults to "alloy"):

The voice to use. Supported voices are "alloy", "ash", "ballad", "coral", etc.

参数:
  • client_kwargs (dict | None, default None) -- The extra keyword arguments to initialize the OpenAI client.

  • generate_kwargs (dict[str, JSONSerializableObject] | None, optional) -- The extra keyword arguments used in OpenAI API generation, e.g. temperature, seed.

  • api_key (str)

  • model_name (str)

  • voice (Literal['alloy', 'ash', 'ballad', 'coral'] | str)

  • stream (bool)

返回类型:

None

async synthesize(msg=None, **kwargs)[源代码]

Append text to be synthesized and return TTS response.

参数:
  • msg (Msg | None, optional) -- The message to be synthesized.

  • **kwargs (Any) -- Additional keyword arguments to pass to the TTS API call.

返回:

The TTSResponse object in non-streaming mode, or an async generator yielding TTSResponse objects in streaming mode.

返回类型:

TTSResponse | AsyncGenerator[TTSResponse, None]