About
A lightweight MCP server that lets LLMs query and filter web crawler archives—supporting ArchiveBox, HTTrack, WARC, wget, and more—with full‑text boolean search, type/status filtering, and markdown snippets.
Capabilities
mcp-server-webcrawl is an MCP server that turns any locally‑hosted web crawl into a searchable, AI‑ready knowledge base. It bridges the gap between raw archive data—whether from ArchiveBox, HTTrack, WARC files, or other crawling tools—and the natural‑language queries that AI assistants like Claude prefer. By exposing a full‑text search interface with Boolean logic, the server lets developers and content analysts query vast amounts of web data quickly, without writing custom parsers or indexing pipelines.
The server’s core value lies in its filter‑first approach. Once a crawl is ingested, users can narrow results by content type (HTML, JSON, images), HTTP status codes, timestamps, or custom metadata. This fine‑grained filtering means an LLM can ask for “all 404 pages from the last month” or “every article tagged with ‘climate change’”, and receive precise, context‑rich snippets. The ability to return Markdown‑formatted excerpts or raw code blocks further enhances the usability of responses in documentation, debugging, or compliance checks.
Key capabilities include:
- Multi‑crawler compatibility: Support for ArchiveBox, HTTrack, InterroBot, Katana, SiteOne, WARC archives, and wget mirrors.
- Boolean search: Combine terms with AND, OR, NOT for complex queries.
- Rich filtering: By type, HTTP status, crawl depth, and more.
- Prompt routines: Pre‑built Markdown prompts (e.g., SEO audit, 404 audit) that can be copied into an AI session and executed automatically.
- Claude Desktop integration: Designed to work out of the box with Claude’s desktop client, requiring only Python 3.10+.
Real‑world use cases span web compliance audits, competitive intelligence gathering, and automated content quality checks. For example, a marketing team can run an SEO audit prompt against the latest crawl of their site to surface missing meta tags or broken links, while a security analyst can filter for pages with high‑risk HTTP status codes and feed them into an automated remediation workflow.
In practice, developers embed the server in their existing AI pipelines: a user query triggers an MCP request; the server returns structured search results; the LLM formats and presents them, optionally looping with additional prompts. This tight integration removes the need for bespoke search engines or manual data wrangling, giving AI assistants a powerful, ready‑to‑use knowledge base that scales with the size of your web archives.
Related Servers
MindsDB MCP Server
Unified AI-driven data query across all sources
Homebrew Legacy Server
Legacy Homebrew repository split into core formulae and package manager
Daytona
Secure, elastic sandbox infrastructure for AI code execution
SafeLine WAF Server
Secure your web apps with a self‑hosted reverse‑proxy firewall
mediar-ai/screenpipe
MCP Server: mediar-ai/screenpipe
Skyvern
MCP Server: Skyvern
Weekly Views
Server Health
Information
Explore More Servers
HexaGO Calculator Server
A Go hexagonal architecture calculator demo
MCP OCR Server
OCR via MCP with Tesseract integration
AdsPower LocalAPI MCP Server Python
Control AdsPower browsers via LLMs with local API
Cloudsway SmartSearch MCP Server
Web search powered by Cloudsway for MCP clients
MCP Metaso
AI-powered multi-dimensional search engine via MCP
Code Runner MCP
Secure, on-demand code execution for JavaScript and Python