agentscope.service.web.web_digest module
parsing and digesting the web pages
- digest_webpage(web_text_or_url: str, model: ModelWrapperBase | None = None, html_selected_tags: Sequence[str] = ('h', 'p', 'li', 'div', 'a'), digest_prompt: str = "You're a web page analyser. You job is to extract importantand useful information from html or webpage description.\n") ServiceResponse [source]
Digest the given webpage.
- Parameters:
web_text_or_url (str) – preprocessed web text or url to the web page
model (ModelWrapperBase) – the model to digest the web content
html_selected_tags (Sequence[str]) – the text in elements of html_selected_tags will be extracted and feed to the model
digest_prompt (str) – system prompt for the model to digest the web content
- Returns:
If successful, ServiceResponse object is returned with content field filled with the model output.
- Return type:
ServiceResponse
- is_valid_url(url: str) bool [source]
Use urlparse to check if a URL is valid :param url: string to be checked :type url: str
- Returns:
True if url is valid, False otherwise
- Return type:
bool
- load_web(url: str, keep_raw: bool = True, html_selected_tags: Sequence[str] | None = None, self_parse_func: Callable[[Response], Any] | None = None, timeout: int = 5) ServiceResponse [source]
Function for parsing and digesting the web page.
- Parameters:
url (str) – the url of the web page
keep_raw (bool) – Whether to keep raw HTML. If True, the content is stored with key “raw”.
html_selected_tags (Optional[Sequence[str]]) – the text in elements of html_selected_tags will be extracted and stored with “html_to_text” key in return.
self_parse_func (Optional[Callable]) – if “self_parse_func” is not None, then the function will be invoked with the requests.Response as input. The result is stored with self_define_func key
timeout (int) – timeout parameter for requests.
- Returns:
If successful, ServiceResponse object is returned with content field is a dict, where keys are subset of:
”raw”: exists if keep_raw is True, store raw HTML content`;
”self_define_func”: exists if self_parse_func is provided, store the return of self_define_func;
”html_to_text”: exists if html_selected_tags is provided and not empty;
”json”: exists if url links to a json webpage, then it is parsed as json.
For example, ServiceResponse.content field is
{ "raw": xxxxx, "selected_tags_text": xxxxx }
- Return type:
ServiceResponse
- parse_html_to_text(html_text: str, html_selected_tags: Sequence[str] | None = None) str [source]
Parse the obtained HTML file.
- Parameters:
html_text (str) – HTML source code
html_selected_tags (Optional[Sequence[str]]) – the text in elements of html_selected_tags will be extracted and returned.
- Returns:
If successful, ServiceResponse object is returned with content field is processed text content of the selected tags,
- Return type:
ServiceResponse