agentscope.rag¶
The retrieval-augmented generation (RAG) module in AgentScope.
- class ReaderBase[源代码]¶
基类:
object
The reader base class, which is responsible for reading the original data, splitting it into chunks, and converting each chunk into a Document object.
- class TextReader[源代码]¶
基类:
ReaderBase
The text reader that splits text into chunks by a fixed chunk size and chunk overlap.
- __init__(chunk_size=512, split_by='sentence')[源代码]¶
Initialize the text reader.
- 参数:
chunk_size (int, default to 512) -- The size of each chunk, in number of characters.
split_by (Literal["char", "paragraph"], default to "sentence") -- The unit to split the text, can be "char", "sentence", or "paragraph". Note that "sentence" is implemented by "nltk" library, which only supports English text.
- 返回类型:
None
- async __call__(text)[源代码]¶
Read a text string, split it into chunks, and return a list of Document objects.
- 参数:
text (str) -- The input text string, or a path to the local text file.
- 返回:
A list of Document objects, where the metadata contains the chunked text, doc id and chunk id.
- 返回类型:
list[Document]
- class PDFReader[源代码]¶
基类:
ReaderBase
The PDF reader that splits text into chunks by a fixed chunk size.
- __init__(chunk_size=512, split_by='sentence')[源代码]¶
Initialize the text reader.
- 参数:
chunk_size (int, default to 512) -- The size of each chunk, in number of characters.
split_by (Literal["char", "sentence", "paragraph"], default to "sentence") -- The unit to split the text, can be "char", "sentence", or "paragraph". The "sentence" option is implemented using the "nltk" library, which only supports English text.
- 返回类型:
None
- class ImageReader[源代码]¶
基类:
ReaderBase
A simple image reader that wraps the image into a Document object.
This class is only a simple implementation to support multimodal RAG.
- class DocMetadata[源代码]¶
基类:
DictMixin
The metadata of the document.
- content: TextBlock | ImageBlock | VideoBlock¶
The data content, e.g., text, image, video.
- doc_id: str¶
The document ID.
- chunk_id: int¶
The chunk ID.
- total_chunks: int¶
The total number of chunks.
- __init__(content, doc_id, chunk_id, total_chunks)¶
- 参数:
content (TextBlock | ImageBlock | VideoBlock)
doc_id (str)
chunk_id (int)
total_chunks (int)
- 返回类型:
None
- class Document[源代码]¶
基类:
object
The data chunk.
- __init__(metadata, id=<factory>, embedding=<factory>, score=None)¶
- 参数:
metadata (DocMetadata)
id (str)
embedding (List[float] | None)
score (float | None)
- 返回类型:
None
- metadata: DocMetadata¶
The metadata of the data chunk.
- id: str¶
The unique ID of the data chunk.
- embedding: List[float] | None¶
The embedding of the data chunk.
- score: float | None = None¶
The relevance score of the data chunk.
- class VDBStoreBase[源代码]¶
基类:
object
The vector database store base class, serving as a middle layer between the knowledge base and the actual vector database implementation.
- abstract async add(documents, **kwargs)[源代码]¶
Record the documents into the vector database.
- 参数:
documents (list[Document])
kwargs (Any)
- 返回类型:
None
- abstract async delete(*args, **kwargs)[源代码]¶
Delete texts from the embedding store.
- 参数:
args (Any)
kwargs (Any)
- 返回类型:
None
- abstract async search(query_embedding, limit, score_threshold=None, **kwargs)[源代码]¶
Retrieve relevant texts for the given queries.
- 参数:
query_embedding (Embedding) -- The embedding of the query text.
limit (int) -- The number of relevant documents to retrieve.
score_threshold (float | None, optional) -- The threshold of the score to filter the results.
**kwargs (Any) -- Other keyword arguments for the vector database search API.
- 返回类型:
list[Document]
- class QdrantStore[源代码]¶
基类:
VDBStoreBase
The Qdrant vector store implementation, supporting both local and remote Qdrant instances.
备注
In Qdrant, we use the
payload
field to store the metadata,including the document ID, chunk ID, and original content.
- __init__(location, collection_name, dimensions, distance='Cosine', client_kwargs=None, collection_kwargs=None)[源代码]¶
Initialize the local Qdrant vector store.
- 参数:
(`Literal[" (location) --
memory:"] | str`): The location of the Qdrant instance. Use ":memory:" for in-memory Qdrant instance, or url for remote Qdrant instance, e.g. "http://localhost:6333" or a path to a directory.
collection_name (str) -- The name of the collection to store the embeddings.
dimensions (int) -- The dimension of the embeddings.
distance (Literal["Cosine", "Euclid", "Dot", "Manhattan"], default to "Cosine") -- The distance metric to use for the collection. Can be one of "Cosine", "Euclid", "Dot", or "Manhattan". Defaults to "Cosine".
client_kwargs (dict[str, Any] | None, optional) -- Other keyword arguments for the Qdrant client.
collection_kwargs (dict[str, Any] | None, optional) -- Other keyword arguments for creating the collection.
location (Literal[':memory:'] | str)
- 返回类型:
None
- async add(documents, **kwargs)[源代码]¶
Add embeddings to the Qdrant vector store.
- 参数:
documents (list[Document]) -- A list of embedding records to be recorded in the Qdrant store.
kwargs (Any)
- 返回类型:
None
- async search(query_embedding, limit, score_threshold=None, **kwargs)[源代码]¶
Search relevant documents from the Qdrant vector store.
- 参数:
query_embedding (Embedding) -- The embedding of the query text.
limit (int) -- The number of relevant documents to retrieve.
score_threshold (float | None, optional) -- The threshold of the score to filter the results.
**kwargs (Any) -- Other keyword arguments for the Qdrant client search API.
- 返回类型:
list[Document]
- class KnowledgeBase[源代码]¶
基类:
object
The knowledge base abstraction for retrieval-augmented generation (RAG).
The
retrieve
andadd_documents
methods need to be implemented in the subclasses. We also provide a quick methodretrieve_knowledge
that enables the agent to retrieve knowledge easily.- __init__(embedding_store, embedding_model)[源代码]¶
Initialize the knowledge base.
- 参数:
embedding_store (VDBStoreBase)
embedding_model (EmbeddingModelBase)
- 返回类型:
None
- embedding_store: VDBStoreBase¶
The embedding store for the knowledge base.
- embedding_model: EmbeddingModelBase¶
The embedding model for the knowledge base.
- abstract async retrieve(query, limit=5, score_threshold=None, **kwargs)[源代码]¶
Retrieve relevant documents by the given query.
- 参数:
query (str) -- The query string to retrieve relevant documents.
limit (int, defaults to 5) -- The number of relevant documents to retrieve.
score_threshold (float | None, defaults to None) -- The score threshold to filter the retrieved documents. If provided, only documents with a score higher than the threshold will be returned.
**kwargs (Any) -- Other keyword arguments for the vector database search API.
- 返回类型:
list[Document]
- abstract async add_documents(documents, **kwargs)[源代码]¶
Add documents to the knowledge base, which will embed the documents and store them in the embedding store.
- 参数:
documents (list[Document]) -- A list of documents to add.
kwargs (Any)
- 返回类型:
None
- async retrieve_knowledge(query, limit=5, score_threshold=None, **kwargs)[源代码]¶
Retrieve relevant documents from the knowledge base. Note the query parameter is directly related to the retrieval quality, and for the same question, you can try many different queries to get the best results. Adjust the limit and score_threshold parameters to get more or fewer results.
- 参数:
query (str) -- The query string, which should be specific and concise. For example, you should provide the specific name instead of "you", "my", "he", "she", etc.
limit (int, defaults to 3) -- The number of relevant documents to retrieve.
score_threshold (float, defaults to 0.8) -- A threshold in [0, 1] and only the relevance score above this threshold will be returned. Reduce this value to get more results.
kwargs (Any)
- 返回类型:
- class SimpleKnowledge[源代码]¶
-
A simple knowledge base implementation.
- async retrieve(query, limit=5, score_threshold=None, **kwargs)[源代码]¶
Retrieve relevant documents by the given queries.
- 参数:
query (str) -- The query string to retrieve relevant documents.
limit (int, defaults to 5) -- The number of relevant documents to retrieve.
score_threshold (float | None) -- float | None = None, The threshold of the score to filter the results.
**kwargs (Any) -- Other keyword arguments for the vector database search API.
- 返回:
A list of relevant documents.
- 返回类型:
list[Document]
TODO: handle the case when the query is too long.