MCP Server Webcrawl

MCP Server

AI‑powered search for web crawl data

Active(72)

25stars

0views

Updated 22 days ago

About

A lightweight MCP server that lets LLMs query and filter web crawler archives—supporting ArchiveBox, HTTrack, WARC, wget, and more—with full‑text boolean search, type/status filtering, and markdown snippets.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

MCP Server Webcrawl

mcp-server-webcrawl is an MCP server that turns any locally‑hosted web crawl into a searchable, AI‑ready knowledge base. It bridges the gap between raw archive data—whether from ArchiveBox, HTTrack, WARC files, or other crawling tools—and the natural‑language queries that AI assistants like Claude prefer. By exposing a full‑text search interface with Boolean logic, the server lets developers and content analysts query vast amounts of web data quickly, without writing custom parsers or indexing pipelines.

The server’s core value lies in its filter‑first approach. Once a crawl is ingested, users can narrow results by content type (HTML, JSON, images), HTTP status codes, timestamps, or custom metadata. This fine‑grained filtering means an LLM can ask for “all 404 pages from the last month” or “every article tagged with ‘climate change’”, and receive precise, context‑rich snippets. The ability to return Markdown‑formatted excerpts or raw code blocks further enhances the usability of responses in documentation, debugging, or compliance checks.

Key capabilities include:

Multi‑crawler compatibility: Support for ArchiveBox, HTTrack, InterroBot, Katana, SiteOne, WARC archives, and wget mirrors.
Boolean search: Combine terms with AND, OR, NOT for complex queries.
Rich filtering: By type, HTTP status, crawl depth, and more.
Prompt routines: Pre‑built Markdown prompts (e.g., SEO audit, 404 audit) that can be copied into an AI session and executed automatically.
Claude Desktop integration: Designed to work out of the box with Claude’s desktop client, requiring only Python 3.10+.

Real‑world use cases span web compliance audits, competitive intelligence gathering, and automated content quality checks. For example, a marketing team can run an SEO audit prompt against the latest crawl of their site to surface missing meta tags or broken links, while a security analyst can filter for pages with high‑risk HTTP status codes and feed them into an automated remediation workflow.

In practice, developers embed the server in their existing AI pipelines: a user query triggers an MCP request; the server returns structured search results; the LLM formats and presents them, optionally looping with additional prompts. This tight integration removes the need for bespoke search engines or manual data wrangling, giving AI assistants a powerful, ready‑to‑use knowledge base that scales with the size of your web archives.