# 网页爬虫

一种专门构建的抓取工具，独立于通用的 `http_request` / `curl`。之所以存在，是因为代理不想要原始 HTML——它想要的是 *文章*.

## 它的作用

* 获取一个 URL。
* 去除模板内容（导航、广告、页脚、脚本）。
* 返回可供代理进行推理的干净文本。

## 防护措施

* 将响应上限限制为 1 MB——较大的页面会被截断，而不是悄无声息地丢弃。
* 20 秒超时——较慢的服务器不会拖慢对话。
* 受与其他网络工具相同的代理和 URL 保护规则约束。

## 它的适用场景

* 阅读文章、博客帖子、文档页面、GitHub README，而不受噪音干扰。
* 跟进一个 [网页搜索](/openhuman/zh/gong-neng/native-tools/web-search.md) 结果。
* 按需总结单个页面。

## 另见

* [网页搜索](/openhuman/zh/gong-neng/native-tools/web-search.md) - 找到要输入抓取器的 URL。
* [智能 Token 压缩](/openhuman/zh/gong-neng/token-compression.md) - 在长页面到达模型之前，将其裁剪掉。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tinyhumans.gitbook.io/openhuman/zh/gong-neng/native-tools/web-scraper.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.