agentscope.service.web.web_digest module

parsing and digesting the web pages

digest_webpage(web_text_or_url: str, model: ModelWrapperBase | None = None, html_selected_tags: Sequence[str] = ('h', 'p', 'li', 'div', 'a'), digest_prompt: str = "You're a web page analyser. You job is to extract importantand useful information from html or webpage description.\n") ServiceResponse[source]

Digest the given webpage.

Parameters:
  • web_text_or_url (str) – preprocessed web text or url to the web page

  • model (ModelWrapperBase) – the model to digest the web content

  • html_selected_tags (Sequence[str]) – the text in elements of html_selected_tags will be extracted and feed to the model

  • digest_prompt (str) – system prompt for the model to digest the web content

Returns:

If successful, ServiceResponse object is returned with content field filled with the model output.

Return type:

ServiceResponse

is_valid_url(url: str) bool[source]

Use urlparse to check if a URL is valid :param url: string to be checked :type url: str

Returns:

True if url is valid, False otherwise

Return type:

bool

load_web(url: str, keep_raw: bool = True, html_selected_tags: Sequence[str] | None = None, self_parse_func: Callable[[Response], Any] | None = None, timeout: int = 5) ServiceResponse[source]

Function for parsing and digesting the web page.

Parameters:
  • url (str) – the url of the web page

  • keep_raw (bool) – Whether to keep raw HTML. If True, the content is stored with key “raw”.

  • html_selected_tags (Optional[Sequence[str]]) – the text in elements of html_selected_tags will be extracted and stored with “html_to_text” key in return.

  • self_parse_func (Optional[Callable]) – if “self_parse_func” is not None, then the function will be invoked with the requests.Response as input. The result is stored with self_define_func key

  • timeout (int) – timeout parameter for requests.

Returns:

If successful, ServiceResponse object is returned with content field is a dict, where keys are subset of:

”raw”: exists if keep_raw is True, store raw HTML content`;

”self_define_func”: exists if self_parse_func is provided, store the return of self_define_func;

”html_to_text”: exists if html_selected_tags is provided and not empty;

”json”: exists if url links to a json webpage, then it is parsed as json.

For example, ServiceResponse.content field is

{
    "raw": xxxxx,
    "selected_tags_text": xxxxx
}

Return type:

ServiceResponse

parse_html_to_text(html_text: str, html_selected_tags: Sequence[str] | None = None) str[source]

Parse the obtained HTML file.

Parameters:
  • html_text (str) – HTML source code

  • html_selected_tags (Optional[Sequence[str]]) – the text in elements of html_selected_tags will be extracted and returned.

Returns:

If successful, ServiceResponse object is returned with content field is processed text content of the selected tags,

Return type:

ServiceResponse