← Back to all guides

Web Crawler Tools — Safe & Structured Web Scraping

Langoedge Team3 min read

What are the Web Crawler Tools?

Langoedge provides powerful web crawling tools designed to retrieve real-time data from external websites. Rather than returning raw, messy HTML, these tools parse pages, strip away advertisements and navigation blocks, and present clean, structured Markdown directly to your AI agents.

The platform exposes two principal crawling tools:

  1. crawl_url (Native Adaptive Crawler): Built on high-performance crawling engines with built-in sandbox controls and security layers.
  2. fire_crawl_url (Firecrawl Integration): Integrates with external scraping APIs for handling complex rendering or bypassing strict anti-bot systems.

1. The Native Web Crawler (crawl_url)

The native web crawler is optimized for speed, deep crawling, and structural output. It supports crawling a single URL or executing a breadth-first search (BFS) across multiple pages.

Security & SSRF Protection

For enterprise security, the native crawler enforces strict Server-Side Request Forgery (SSRF) protections. The gateway verifies every destination IP address before connection:

  • Blocked Ranges: All private IP subnets (e.g., 10.0.0.0/8, 192.168.0.0/16, 172.16.0.0/12), loopbacks (127.0.0.1, ::1), and cloud metadata services (169.254.169.254).
  • Domain Isolation: Any attempt to query internal workspace microservices or hostnames (like localhost or cloud dashboard services) will raise an UnsafeURLError and halt execution instantly.

Tool Parameters

When configuring a crawl_url action within a graph node, you can define the following settings:

Parameter Type Required Description
url string Yes The starting HTTP/HTTPS link to crawl.
url_patterns array[string] No Glob/Regex patterns to match. Only URLs matching these patterns will be crawled.
allowed_domains array[string] No Whitelist of domains. The crawler will not exit these boundaries.
blocked_domains array[string] No Blacklist of domains to skip.

Technical Output Structure

The tool returns a list of web page content blocks structured as follows:

[
  {
    "title": "Page Title",
    "content": "## Section Heading\n\nThis is structured markdown text extracted from the page.",
    "metadata": {
      "source": "https://example.com/sub-page"
    }
  }
]

2. The Firecrawl scraper (fire_crawl_url)

For web pages that rely heavily on client-side React/Vue hydration or require advanced proxy rotation to avoid bot detection, use the fire_crawl_url tool.

Tool Parameters

Parameter Type Required Description
url string Yes The web page URL to scrape.

[!TIP]
Which one to choose?
Use crawl_url by default for standard blogs, documentations, and clean text resources. Swap to fire_crawl_url if the target site is a single-page app (SPA) that requires browser execution to render content.


Frequently Asked Questions

Does crawl_url support JavaScript-rendered pages?
Yes. The native crawler utilizes a headless browser configuration to render pages before extracting text. If a site uses custom obfuscation, use `fire_crawl_url` as a fallback.
How deep does the native crawler go?
The deep-crawling strategy traverses link frontiers up to a maximum depth of 20 pages, ensuring it captures sub-links within the allowed domain while ignoring external marketing tracker links.
Can I crawl intranet pages?
No. Due to the SSRF firewalls protecting the Langoedge deployment cluster, the crawler cannot connect to local network addresses or non-public domains.
LT

Langoedge Team

The Langoedge engineering team builds AI agent infrastructure that empowers businesses to deploy reliable, observable AI staff. Follow Langoedge Team on LinkedIn for product updates and architectural deep dives.