Web Crawler Tools — Safe & Structured Web Scraping
What are the Web Crawler Tools?
Langoedge provides powerful web crawling tools designed to retrieve real-time data from external websites. Rather than returning raw, messy HTML, these tools parse pages, strip away advertisements and navigation blocks, and present clean, structured Markdown directly to your AI agents.
The platform exposes two principal crawling tools:
crawl_url(Native Adaptive Crawler): Built on high-performance crawling engines with built-in sandbox controls and security layers.fire_crawl_url(Firecrawl Integration): Integrates with external scraping APIs for handling complex rendering or bypassing strict anti-bot systems.
1. The Native Web Crawler (crawl_url)
The native web crawler is optimized for speed, deep crawling, and structural output. It supports crawling a single URL or executing a breadth-first search (BFS) across multiple pages.
Security & SSRF Protection
For enterprise security, the native crawler enforces strict Server-Side Request Forgery (SSRF) protections. The gateway verifies every destination IP address before connection:
- Blocked Ranges: All private IP subnets (e.g.,
10.0.0.0/8,192.168.0.0/16,172.16.0.0/12), loopbacks (127.0.0.1,::1), and cloud metadata services (169.254.169.254). - Domain Isolation: Any attempt to query internal workspace microservices or hostnames (like
localhostor cloud dashboard services) will raise anUnsafeURLErrorand halt execution instantly.
Tool Parameters
When configuring a crawl_url action within a graph node, you can define the following settings:
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string |
Yes | The starting HTTP/HTTPS link to crawl. |
url_patterns |
array[string] |
No | Glob/Regex patterns to match. Only URLs matching these patterns will be crawled. |
allowed_domains |
array[string] |
No | Whitelist of domains. The crawler will not exit these boundaries. |
blocked_domains |
array[string] |
No | Blacklist of domains to skip. |
Technical Output Structure
The tool returns a list of web page content blocks structured as follows:
[
{
"title": "Page Title",
"content": "## Section Heading\n\nThis is structured markdown text extracted from the page.",
"metadata": {
"source": "https://example.com/sub-page"
}
}
]
2. The Firecrawl scraper (fire_crawl_url)
For web pages that rely heavily on client-side React/Vue hydration or require advanced proxy rotation to avoid bot detection, use the fire_crawl_url tool.
Tool Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string |
Yes | The web page URL to scrape. |
[!TIP]
Which one to choose?
Usecrawl_urlby default for standard blogs, documentations, and clean text resources. Swap tofire_crawl_urlif the target site is a single-page app (SPA) that requires browser execution to render content.