agentscope.rag

The retrieval-augmented generation (RAG) module in AgentScope.

class ReaderBase[源代码]

基类:object

The reader base class, which is responsible for reading the original data, splitting it into chunks, and converting each chunk into a Document object.

abstract async __call__(*args, **kwargs)[源代码]

The async call function that takes the input files and returns the vector records

参数:
  • args (Any)

  • kwargs (Any)

返回类型:

list[Document]

abstract get_doc_id(*args, **kwargs)[源代码]

Get a unique document ID for the input data. This method is to expose the document ID generation logic to the developers

返回:

A unique document ID for the input data.

返回类型:

str

参数:
  • args (Any)

  • kwargs (Any)

class TextReader[源代码]

基类:ReaderBase

The text reader that splits text into chunks by a fixed chunk size and chunk overlap.

__init__(chunk_size=512, split_by='sentence')[源代码]

Initialize the text reader.

参数:
  • chunk_size (int, default to 512) -- The size of each chunk, in number of characters.

  • split_by (Literal["char", "paragraph"], default to "sentence") -- The unit to split the text, can be "char", "sentence", or "paragraph". Note that "sentence" is implemented by "nltk" library, which only supports English text.

返回类型:

None

async __call__(text)[源代码]

Read a text string, split it into chunks, and return a list of Document objects.

参数:

text (str) -- The input text string, or a path to the local text file.

返回:

A list of Document objects, where the metadata contains the chunked text, doc id and chunk id.

返回类型:

list[Document]

get_doc_id(text)[源代码]

Get the document ID. This function can be used to check if the doc_id already exists in the knowledge base.

参数:

text (str)

返回类型:

str

class PDFReader[源代码]

基类:ReaderBase

The PDF reader that splits text into chunks by a fixed chunk size.

__init__(chunk_size=512, split_by='sentence')[源代码]

Initialize the text reader.

参数:
  • chunk_size (int, default to 512) -- The size of each chunk, in number of characters.

  • split_by (Literal["char", "sentence", "paragraph"], default to "sentence") -- The unit to split the text, can be "char", "sentence", or "paragraph". The "sentence" option is implemented using the "nltk" library, which only supports English text.

返回类型:

None

async __call__(pdf_path)[源代码]

Read a PDF file, split it into chunks, and return a list of Document objects.

参数:

pdf_path (str) -- The input PDF file path.

返回类型:

list[Document]

get_doc_id(pdf_path)[源代码]

Get the document ID. This function can be used to check if the doc_id already exists in the knowledge base.

参数:

pdf_path (str)

返回类型:

str

class ImageReader[源代码]

基类:ReaderBase

A simple image reader that wraps the image into a Document object.

This class is only a simple implementation to support multimodal RAG.

async __call__(image_url)[源代码]

Read an image and return the wrapped Document object.

参数:

image_url (str | list[str]) -- The image URL(s) or path(s).

返回:

A list of Document objects containing the image data.

返回类型:

list[Document]

get_doc_id(image_path)[源代码]

Generate a document ID based on the image path.

参数:

image_path (str) -- The image path or URL.

返回:

The generated document ID.

返回类型:

str

class WordReader[源代码]

基类:ReaderBase

The reader that supports reading text, image, and table content from Word documents (.docx files), and chunking the text content into smaller pieces.

备注

The table content can be extracted in Markdown or JSON format.

__init__(chunk_size=512, split_by='sentence', include_image=True, separate_table=False, table_format='markdown')[源代码]

Initialize the Word reader.

Args:
chunk_size (int, default to 512):

The size of each chunk, in number of characters.

split_by (Literal["char", "sentence", "paragraph"], default to "sentence"):

The unit to split the text, can be "char", "sentence", or "paragraph". The "sentence" option is implemented using the "nltk" library, which only supports English text.

include_image (bool, default to False):

Whether to include image content in the returned document. If activated, the embedding model you use must support image input, e.g. DashScopeMultiModalEmbedding.

separate_table (bool, default to False):

If True, tables will be treated as a new chunk to avoid truncation. But note when the table exceeds the chunk size, it will still be truncated.

table_format (Literal["markdown", "json"], default to "markdown"):

The format to extract table content. Note if the table cell contains `

`, the Markdown format may not render correctly.

In that case, you can use the json format, which extracts the table as a JSON string of a list[list[str]] object.

参数:
  • chunk_size (int)

  • split_by (Literal['char', 'sentence', 'paragraph'])

  • include_image (bool)

  • separate_table (bool)

  • table_format (Literal['markdown', 'json'])

返回类型:

None

async __call__(word_path)[源代码]

Read a Word document, split it into chunks, and return a list of Document objects. The text, image, and table content will be returned in the same order as they appear in the Word document.

参数:

word_path (str) -- The input Word document file path (.docx file).

返回:

A list of Document objects, where the metadata contains the chunked text, doc id and chunk id.

返回类型:

list[Document]

get_doc_id(word_path)[源代码]

Generate a document ID based on the Word file path.

参数:

word_path (str) -- The Word file path.

返回:

The generated document ID.

返回类型:

str

class ExcelReader[源代码]

基类:ReaderBase

The Excel reader that supports reading text, image, and table content from Excel files (.xlsx, .xls files), and chunking the text content into smaller pieces.

备注

The table content can be extracted in Markdown or JSON format.

Markdown format example (include_cell_coordinates=False):

| Name  | Age | City     |
|-------|-----|----------|
| Alice | 25  | New York |
| Bob   | 30  | London   |

Markdown format example (include_cell_coordinates=True):

| [A1] Name  | [B1] Age | [C1] City     |
|------------|----------|---------------|
| [A2] Alice | [B2] 25  | [C2] New York |
| [A3] Bob   | [B3] 30  | [C3] London   |

JSON format example (include_cell_coordinates=False):

["Name", "Age", "City"]
["Alice", "25", "New York"]
["Bob", "30", "London"]

JSON format example (include_cell_coordinates=True):

{"A1": "Name", "B1": "Age", "C1": "City"}
{"A2": "Alice", "B2": "25", "C2": "New York"}
{"A3": "Bob", "B3": "30", "C3": "London"}
__init__(chunk_size=512, split_by='sentence', include_sheet_names=True, include_cell_coordinates=False, include_image=False, separate_sheet=False, separate_table=False, table_format='markdown')[源代码]

Initialize the Excel reader.

Args:
chunk_size (int, default to 512):

The size of each chunk, in number of characters.

split_by (Literal["char", "sentence", "paragraph"], default to "sentence"):

The unit to split the text, can be "char", "sentence", or "paragraph". The "sentence" option is implemented using the "nltk" library, which only supports English text.

include_sheet_names (bool, default to True):

Whether to include sheet names in the extracted text.

include_cell_coordinates (bool, default to False):

Whether to include cell coordinates (e.g., A1, B2) in the extracted text.

include_image (bool, default to False):

Whether to include image content in the document. If True, images will be extracted and included as base64-encoded images.

separate_sheet (bool, default to False):

Whether to treat each sheet as a separate document. If True, each sheet will be extracted as a separate Document object instead of being merged together.

separate_table (bool, default to False):

If True, tables will be treated as a new chunk to avoid truncation. But note when the table exceeds the chunk size, it will still be truncated.

table_format (Literal["markdown", "json"], default to "markdown"):

The format to extract table content. Note if the table cell contains `

`, the Markdown format may not render correctly.

In that case, you can use the json format, which extracts the table as a JSON string of a list[list[str]] object.

参数:
  • chunk_size (int)

  • split_by (Literal['char', 'sentence', 'paragraph'])

  • include_sheet_names (bool)

  • include_cell_coordinates (bool)

  • include_image (bool)

  • separate_sheet (bool)

  • separate_table (bool)

  • table_format (Literal['markdown', 'json'])

返回类型:

None

async __call__(excel_path)[源代码]

Read an Excel file, split it into chunks, and return a list of Document objects. The text, image, and table content will be returned in the same order as they appear in the Excel file.

参数:

excel_path (str) -- The input Excel file path (.xlsx or .xls file).

返回:

A list of Document objects, where the metadata contains the chunked text, doc id and chunk id.

返回类型:

list[Document]

get_doc_id(excel_path)[源代码]

Generate unique document ID from file path.

参数:

excel_path (str) -- The path to the Excel file.

返回:

The document ID (SHA256 hash of the file path).

返回类型:

str

class PowerPointReader[源代码]

基类:ReaderBase

The PowerPoint reader that supports reading text, image, and table content from PowerPoint presentations (.pptx files), and chunking the text content into smaller pieces.

备注

The table content can be extracted in Markdown or JSON format.

__init__(chunk_size=512, split_by='sentence', include_image=True, separate_slide=False, separate_table=False, table_format='markdown', slide_prefix='<slide index={index}>', slide_suffix='</slide>')[源代码]

Initialize the PowerPoint reader.

Args:
chunk_size (int, default to 512):

The size of each chunk, in number of characters.

split_by (Literal["char", "sentence", "paragraph"], default to "sentence"):

The unit to split the text, can be "char", "sentence", or "paragraph". The "sentence" option is implemented using the "nltk" library, which only supports English text.

include_image (bool, default to True):

Whether to include image content in the document. If True, images will be extracted and included as base64-encoded images.

separate_slide (bool, default to False):

Whether to treat each slide as a separate document. If True, each slide will be extracted as a separate Document object instead of being merged together.

separate_table (bool, default to False):

If True, tables will be treated as a new chunk to avoid truncation. But note when the table exceeds the chunk size, it will still be truncated.

table_format (Literal["markdown", "json"], default to "markdown"):

The format to extract table content. Note if the table cell contains `

`, the Markdown format may not render correctly.

In that case, you can use the json format, which extracts the table as a JSON string of a list[list[str]] object.

slide_prefix (str, default to <slide index={index}>):

Optional prefix to add before each slide's content. Supports {index} placeholder for 1-based slide number. For example, "<slide index={index}>" will produce "<slide index=1>" for the first slide. If None, no prefix is added.

slide_suffix (str, default to </slide>):

Optional suffix to add after each slide's content. For example, "</slide>". If None, no suffix is added.

参数:
  • chunk_size (int)

  • split_by (Literal['char', 'sentence', 'paragraph'])

  • include_image (bool)

  • separate_slide (bool)

  • separate_table (bool)

  • table_format (Literal['markdown', 'json'])

  • slide_prefix (str | None)

  • slide_suffix (str | None)

返回类型:

None

async __call__(ppt_path)[源代码]

Read a PowerPoint file, split it into chunks, and return a list of Document objects. The text, image, and table content will be returned in the same order as they appear in the PowerPoint presentation.

参数:

ppt_path (str) -- The input PowerPoint file path (.pptx file).

返回:

A list of Document objects, where the metadata contains the chunked text, doc id and chunk id.

返回类型:

list[Document]

get_doc_id(ppt_path)[源代码]

Generate unique document ID from file path.

参数:

ppt_path (str) -- The path to the PowerPoint file.

返回:

The document ID (SHA256 hash of the file path).

返回类型:

str

class DocMetadata[源代码]

基类:DictMixin

The metadata of the document.

content: TextBlock | ImageBlock | VideoBlock

The data content, e.g., text, image, video.

doc_id: str

The document ID.

chunk_id: int

The chunk ID.

total_chunks: int

The total number of chunks.

__init__(content, doc_id, chunk_id, total_chunks)
参数:
返回类型:

None

class Document[源代码]

基类:object

The data chunk.

__init__(metadata, id=<factory>, embedding=<factory>, score=None)
参数:
  • metadata (DocMetadata)

  • id (str)

  • embedding (List[float] | None)

  • score (float | None)

返回类型:

None

metadata: DocMetadata

The metadata of the data chunk.

id: str

The unique ID of the data chunk.

embedding: List[float] | None

The embedding of the data chunk.

score: float | None = None

The relevance score of the data chunk.

class VDBStoreBase[源代码]

基类:object

The vector database store base class, serving as a middle layer between the knowledge base and the actual vector database implementation.

abstract async add(documents, **kwargs)[源代码]

Record the documents into the vector database.

参数:
  • documents (list[Document])

  • kwargs (Any)

返回类型:

None

abstract async delete(*args, **kwargs)[源代码]

Delete texts from the embedding store.

参数:
  • args (Any)

  • kwargs (Any)

返回类型:

None

abstract async search(query_embedding, limit, score_threshold=None, **kwargs)[源代码]

Retrieve relevant texts for the given queries.

参数:
  • query_embedding (Embedding) -- The embedding of the query text.

  • limit (int) -- The number of relevant documents to retrieve.

  • score_threshold (float | None, optional) -- The threshold of the score to filter the results.

  • **kwargs (Any) -- Other keyword arguments for the vector database search API.

返回类型:

list[Document]

get_client()[源代码]

Get the underlying vector database client, so that developers can access the full functionality of the vector database.

返回类型:

Any

class QdrantStore[源代码]

基类:VDBStoreBase

The Qdrant vector store implementation, supporting both local and remote Qdrant instances.

备注

In Qdrant, we use the payload field to store the metadata,

including the document ID, chunk ID, and original content.

__init__(location, collection_name, dimensions, distance='Cosine', client_kwargs=None, collection_kwargs=None)[源代码]

Initialize the local Qdrant vector store.

参数:
  • (`Literal[" (location) --

    memory:"] | str`): The location of the Qdrant instance. Use ":memory:" for in-memory Qdrant instance, or url for remote Qdrant instance, e.g. "http://localhost:6333" or a path to a directory.

  • collection_name (str) -- The name of the collection to store the embeddings.

  • dimensions (int) -- The dimension of the embeddings.

  • distance (Literal["Cosine", "Euclid", "Dot", "Manhattan"], default to "Cosine") -- The distance metric to use for the collection. Can be one of "Cosine", "Euclid", "Dot", or "Manhattan". Defaults to "Cosine".

  • client_kwargs (dict[str, Any] | None, optional) -- Other keyword arguments for the Qdrant client.

  • collection_kwargs (dict[str, Any] | None, optional) -- Other keyword arguments for creating the collection.

  • location (Literal[':memory:'] | str)

返回类型:

None

async add(documents, **kwargs)[源代码]

Add embeddings to the Qdrant vector store.

参数:
  • documents (list[Document]) -- A list of embedding records to be recorded in the Qdrant store.

  • kwargs (Any)

返回类型:

None

async search(query_embedding, limit, score_threshold=None, **kwargs)[源代码]

Search relevant documents from the Qdrant vector store.

参数:
  • query_embedding (Embedding) -- The embedding of the query text.

  • limit (int) -- The number of relevant documents to retrieve.

  • score_threshold (float | None, optional) -- The threshold of the score to filter the results.

  • **kwargs (Any) -- Other keyword arguments for the Qdrant client search API.

返回类型:

list[Document]

async delete(*args, **kwargs)[源代码]

Delete is not implemented for QdrantStore.

参数:
  • args (Any)

  • kwargs (Any)

返回类型:

None

get_client()[源代码]

Get the underlying Qdrant client, so that developers can access the full functionality of Qdrant.

返回:

The underlying Qdrant client.

返回类型:

AsyncQdrantClient

class MilvusLiteStore[源代码]

基类:VDBStoreBase

The Milvus Lite vector store implementation, supporting both local and remote Milvus instances.

备注

In Milvus Lite, we use the scalar fields to store the metadata,

including the document ID, chunk ID, and original content. The new MilvusClient API is used for simplified operations.

备注

Milvus Lite is not supported on Windows OS for now (2025-10-21).

__init__(uri, collection_name, dimensions, distance='COSINE', token='', client_kwargs=None, collection_kwargs=None)[源代码]

Initialize the Milvus Lite vector store.

参数:
  • uri (str) -- The URI of the Milvus instance. For Milvus Lite, use a local file path like "./milvus_demo.db". For remote Milvus server, use URI like "http://localhost:19530".

  • collection_name (str) -- The name of the collection to store the embeddings.

  • dimensions (int) -- The dimension of the embeddings.

  • distance (Literal["COSINE", "L2", "IP"], default to "COSINE") -- The distance metric to use for the collection. Can be one of "COSINE", "L2", or "IP". Defaults to "COSINE".

  • token (str, defaults to "") -- The token for authentication when connecting to remote Milvus. Format: "username:password". Not needed for Milvus Lite.

  • client_kwargs (dict[str, Any] | None, optional) -- Other keyword arguments for the Milvus client.

  • collection_kwargs (dict[str, Any] | None, optional) -- Other keyword arguments for creating the collection.

返回类型:

None

async add(documents, **kwargs)[源代码]

Add embeddings to the Milvus vector store.

参数:
  • documents (list[Document]) -- A list of embedding records to be recorded in the Milvus store.

  • **kwargs (Any) -- Additional arguments for the insert operation.

返回类型:

None

async search(query_embedding, limit, score_threshold=None, **kwargs)[源代码]

Search relevant documents from the Milvus vector store.

参数:
  • query_embedding (Embedding) -- The embedding of the query text.

  • limit (int) -- The number of relevant documents to retrieve.

  • score_threshold (float | None, optional) -- The threshold of the score to filter the results.

  • **kwargs (Any) -- Additional arguments for the Milvus client search API. - filter (str): Expression to filter the search results. - output_fields (list[str]): Fields to include in results.

返回类型:

list[Document]

async delete(ids=None, filter=None, **kwargs)[源代码]

Delete documents from the Milvus vector store.

参数:
  • ids (list[str] | None, optional) -- List of entity IDs to delete.

  • filter (str | None, optional) -- Expression to filter documents to delete.

  • **kwargs (Any) -- Additional arguments for the delete operation.

返回类型:

None

get_client()[源代码]

Get the underlying Milvus client, so that developers can access the full functionality of Milvus.

返回:

The underlying Milvus client.

返回类型:

MilvusClient

class MongoDBStore[源代码]

基类:VDBStoreBase

MongoDB vector store using MongoDB Vector Search.

This class provides a vector database store implementation using MongoDB's vector search capabilities. It requires MongoDB with vector search support and creates vector search indexes automatically.

备注

Ensure your MongoDB instance supports Vector Search

functionality.

备注

The store automatically creates database, collection, and vector

search index on first operation. No manual initialization is required.

__init__(host, db_name, collection_name, dimensions, index_name='vector_index', distance='cosine', filter_fields=None, client_kwargs=None, db_kwargs=None, collection_kwargs=None)[源代码]

Initialize the MongoDB vector store.

参数:
  • host (str) -- MongoDB connection host, e.g., "mongodb://localhost:27017" or "mongodb+srv://cluster.mongodb.net/".

  • db_name (str) -- Database name to store vector documents.

  • collection_name (str) -- Collection name to store vector documents.

  • dimensions (int) -- Embedding dimensions for the vector search index.

  • index_name (str, defaults to "vector_index") -- Vector search index name.

  • distance (Literal["cosine", "euclidean", "dotProduct"], defaults to "cosine") -- Distance metric for vector similarity. Can be one of "cosine", "euclidean", or "dotProduct".

  • filter_fields (list[str] | None, optional) -- List of field paths to index for filtering in $vectorSearch. For example: ["payload.doc_id", "payload.chunk_id"]. These fields can then be used in the filter parameter of the search method. MongoDB $vectorSearch filter supports: $gt, $gte, $lt, $lte, $eq, $ne, $in, $nin, $exists, $not.

  • client_kwargs (dict[str, Any] | None, optional) -- Additional kwargs for MongoDB client.

  • db_kwargs (dict[str, Any] | None, optional) -- Additional kwargs for database.

  • collection_kwargs (dict[str, Any] | None, optional) -- Additional kwargs for collection.

抛出:

ImportError -- If pymongo is not installed.

返回类型:

None

async add(documents, **kwargs)[源代码]

Insert documents with embeddings into MongoDB.

This method automatically creates the database, collection, and vector search index if they don't exist.

参数:
  • documents (list[Document]) -- List of Document objects to insert.

  • **kwargs (Any) -- Additional arguments (unused).

返回类型:

None

备注

Each inserted record has structure:

{
    "id": str,                # Document ID
    "vector": list[float],    # Vector embedding
    "payload": dict,          # DocMetadata as dict
}
async search(query_embedding, limit, score_threshold=None, **kwargs)[源代码]

Search relevant documents using MongoDB Vector Search.

This method uses MongoDB's $vectorSearch aggregation pipeline for vector similarity search. It automatically waits for the vector search index to be ready before performing the search.

参数:
  • query_embedding (Embedding) -- The embedding vector to search for.

  • limit (int) -- Maximum number of documents to return.

  • score_threshold (float | None, optional) -- Minimum similarity score threshold. Documents with scores below this threshold will be filtered out.

  • **kwargs (Any) -- Additional arguments for the search operation.

返回:

List of Document objects with embedding, score, and metadata.

返回类型:

list[Document]

备注

  • Requires MongoDB with vector search support

  • Uses $vectorSearch aggregation pipeline

async delete(ids=None)[源代码]

Delete documents from the MongoDB collection.

参数:

ids (str | list[str] | None, optional) -- List of document IDs to delete. If provided, deletes documents with matching doc_id in payload.

返回类型:

None

get_client()[源代码]

Get the underlying MongoDB client for advanced operations.

返回:

The AsyncMongoClient instance.

返回类型:

AsyncMongoClient

async delete_collection()[源代码]

Delete the entire collection.

警告

This will permanently delete all documents in the collection.

返回类型:

None

async delete_database()[源代码]

Delete the entire database.

警告

This will permanently delete the entire database and all its collections.

返回类型:

None

async close()[源代码]

Close the MongoDB connection.

This should be called when the store is no longer needed to properly clean up resources.

返回类型:

None

class AlibabaCloudMySQLStore[源代码]

基类:VDBStoreBase

The AlibabaCloud MySQL vector store implementation, supporting vector search operations using MySQL's native vector functions.

备注

AlibabaCloud MySQL vector search requires MySQL 8.0+.

This implementation uses MySQL's native vector functions (VEC_DISTANCE_COSINE, VEC_DISTANCE_EUCLIDEAN, VEC_FROMTEXT) for efficient vector similarity search with ORDER BY in SQL. Only COSINE and EUCLIDEAN distance metrics are supported.

备注

Requires mysql-connector-python package. Install with:

pip install mysql-connector-python

备注

For AlibabaCloud MySQL instances, ensure vector search plugin

is enabled. Contact AlibabaCloud support if needed.

__init__(host, port, user, password, database, table_name, dimensions, distance='COSINE', hnsw_m=16, connection_kwargs=None)[源代码]

Initialize the AlibabaCloud MySQL vector store.

参数:
  • host (str) -- The hostname of the AlibabaCloud MySQL server. Example: "rm-xxxxx.mysql.rds.aliyuncs.com"

  • port (int) -- The port number of the MySQL server (typically 3306).

  • user (str) -- The username for authentication.

  • password (str) -- The password for authentication.

  • database (str) -- The database name to use.

  • table_name (str) -- The name of the table to store the embeddings.

  • dimensions (int) -- The dimension of the embeddings.

  • distance (Literal["COSINE", "EUCLIDEAN"], default to "COSINE") -- The distance metric to use for similarity search. Can be one of "COSINE" (cosine similarity) or "EUCLIDEAN" (Euclidean distance). Defaults to "COSINE".

  • hnsw_m (int, default to 16) -- The M parameter for HNSW vector index, which controls the number of bi-directional links created for each node during construction. Higher values create denser graphs with better recall but use more memory. Typical values range from 4 to 64. Defaults to 16.

  • connection_kwargs (dict[str, Any] | None, optional) -- Other keyword arguments for the MySQL connector. Example: {"ssl_ca": "/path/to/ca.pem", "charset": "utf8mb4"}

返回类型:

None

async add(documents, **kwargs)[源代码]

Add embeddings to the AlibabaCloud MySQL vector store.

参数:
  • documents (list[Document]) -- A list of embedding records to be recorded in the MySQL store.

  • **kwargs (Any) -- Additional arguments for the insert operation.

返回类型:

None

async search(query_embedding, limit, score_threshold=None, **kwargs)[源代码]

Search relevant documents from the AlibabaCloud MySQL vector store.

参数:
  • query_embedding (Embedding) -- The embedding of the query text.

  • limit (int) -- The number of relevant documents to retrieve.

  • score_threshold (float | None, optional) -- The minimum similarity score threshold to filter the results. Score is calculated as 1 - distance, where higher scores indicate higher similarity. Only documents with score >= score_threshold will be returned.

  • **kwargs (Any) -- Additional arguments for the search operation. - filter (str): WHERE clause to filter the search results.

返回类型:

list[Document]

async delete(ids=None, filter=None, **kwargs)[源代码]

Delete documents from the AlibabaCloud MySQL vector store.

参数:
  • ids (list[str] | None, optional) -- List of entity IDs to delete.

  • filter (str | None, optional) -- WHERE clause expression to filter documents to delete.

  • **kwargs (Any) -- Additional arguments for the delete operation.

返回类型:

None

get_client()[源代码]

Get the underlying MySQL connection, so that developers can access the full functionality of AlibabaCloud MySQL.

返回:

The underlying MySQL connection.

返回类型:

MySQLConnection

close()[源代码]

Close the database connection.

返回类型:

None

class KnowledgeBase[源代码]

基类:object

The knowledge base abstraction for retrieval-augmented generation (RAG).

The retrieve and add_documents methods need to be implemented in the subclasses. We also provide a quick method retrieve_knowledge that enables the agent to retrieve knowledge easily.

__init__(embedding_store, embedding_model)[源代码]

Initialize the knowledge base.

参数:
返回类型:

None

embedding_store: VDBStoreBase

The embedding store for the knowledge base.

embedding_model: EmbeddingModelBase

The embedding model for the knowledge base.

abstract async retrieve(query, limit=5, score_threshold=None, **kwargs)[源代码]

Retrieve relevant documents by the given query.

参数:
  • query (str) -- The query string to retrieve relevant documents.

  • limit (int, defaults to 5) -- The number of relevant documents to retrieve.

  • score_threshold (float | None, defaults to None) -- The score threshold to filter the retrieved documents. If provided, only documents with a score higher than the threshold will be returned.

  • **kwargs (Any) -- Other keyword arguments for the vector database search API.

返回类型:

list[Document]

abstract async add_documents(documents, **kwargs)[源代码]

Add documents to the knowledge base, which will embed the documents and store them in the embedding store.

参数:
  • documents (list[Document]) -- A list of documents to add.

  • kwargs (Any)

返回类型:

None

async retrieve_knowledge(query, limit=5, score_threshold=None, **kwargs)[源代码]

Retrieve relevant documents from the knowledge base. Note the query parameter is directly related to the retrieval quality, and for the same question, you can try many different queries to get the best results. Adjust the limit and score_threshold parameters to get more or fewer results.

参数:
  • query (str) -- The query string, which should be specific and concise. For example, you should provide the specific name instead of "you", "my", "he", "she", etc.

  • limit (int, defaults to 3) -- The number of relevant documents to retrieve.

  • score_threshold (float, defaults to 0.8) -- A threshold in [0, 1] and only the relevance score above this threshold will be returned. Reduce this value to get more results.

  • kwargs (Any)

返回类型:

ToolResponse

class SimpleKnowledge[源代码]

基类:KnowledgeBase

A simple knowledge base implementation.

async retrieve(query, limit=5, score_threshold=None, **kwargs)[源代码]

Retrieve relevant documents by the given queries.

参数:
  • query (str) -- The query string to retrieve relevant documents.

  • limit (int, defaults to 5) -- The number of relevant documents to retrieve.

  • score_threshold (float | None) -- float | None = None, The threshold of the score to filter the results.

  • **kwargs (Any) -- Other keyword arguments for the vector database search API.

返回:

A list of relevant documents.

返回类型:

list[Document]

TODO: handle the case when the query is too long.

async add_documents(documents, **kwargs)[源代码]

Add documents to the knowledge

参数:
  • documents (list[Document]) -- The list of documents to add.

  • kwargs (Any)

返回类型:

None