Overview
Discover what makes ArchiveBox powerful
ArchiveBox is a self‑hosted web archiving engine that captures, stores, and serves snapshots of URLs in multiple durable formats (HTML, PDF, PNG, WARC, SQLite, etc.). From a developer’s standpoint it is a full‑stack application written in Python that exposes a REST API, CLI, and webhooks for programmatic interaction. Its core design revolves around *idempotent ingestion*—each URL is hashed, deduplicated, and archived once, then made searchable via a lightweight SQLite index or optionally a PostgreSQL backend for larger deployments. The application is intentionally modular: ingestion pipelines are composed of plug‑in “processors” (e.g., `wget`, `curl`, headless browser renderers, media downloaders) that can be swapped or extended without touching the core.
Language & Runtime
Frameworks
Storage
Background Workers
Overview
ArchiveBox is a self‑hosted web archiving engine that captures, stores, and serves snapshots of URLs in multiple durable formats (HTML, PDF, PNG, WARC, SQLite, etc.). From a developer’s standpoint it is a full‑stack application written in Python that exposes a REST API, CLI, and webhooks for programmatic interaction. Its core design revolves around idempotent ingestion—each URL is hashed, deduplicated, and archived once, then made searchable via a lightweight SQLite index or optionally a PostgreSQL backend for larger deployments. The application is intentionally modular: ingestion pipelines are composed of plug‑in “processors” (e.g., wget, curl, headless browser renderers, media downloaders) that can be swapped or extended without touching the core.
Architecture
- Language & Runtime: Python 3.10+ with asyncio support for concurrent crawling.
- Frameworks: Flask‑based minimal API layer, Jinja2 templates for the web UI; optional
uvicorn/ASGI support for production. - Storage: Default SQLite (file‑based) with optional PostgreSQL or MySQL for high‑volume archives. File system is the primary artifact store; each URL yields a directory of raw HTML, rendered PDFs, PNG screenshots, and extracted media.
- Background Workers: Celery‑like job queue powered by
multiprocessingorasyncio, with optional Redis/RabbitMQ for distributed workers. - Containerization: Official Docker image (
archivebox/archivebox) ships with all dependencies and a pre‑configured entrypoint, making it trivial to run in Kubernetes or Docker Compose. Helm charts are available for cloud deployments. - Extensibility: The
pluginsfolder contains Python modules that implement the Processor interface. Developers can write custom processors to support new data sources (e.g., API feeds, social media scrapers) or output formats. Webhooks and a robust REST API (/api/v1/...) allow integration with CI/CD pipelines, monitoring dashboards, or third‑party services.
Core Capabilities
- Multi‑format Archiving: On ingestion, ArchiveBox creates a canonical snapshot (HTML+assets), renders the page to PDF/PNG via headless Chrome, extracts text, generates WARC files for archival compliance, and stores a JSON manifest.
- Search & Retrieval: A full‑text search index powered by SQLite FTS5 or Elasticsearch (optional) enables instant retrieval of archived URLs. The web UI provides faceted browsing, tag support, and direct file downloads.
- API & Webhooks: CRUD endpoints (
/api/v1/archives) support batch ingestion, status polling, and deletion. Webhooks can be configured to trigger onarchive_createdorarchive_failed, allowing downstream automation. - CLI & Python API: The
archiveboxcommand line mirrors the REST API and can be used in scripts or as a wrapper for other tools. Thearchivebox.apimodule exposes programmatic access to the ingestion pipeline and query functions.
Deployment & Infrastructure
Running ArchiveBox locally requires Python, a database driver, and optional headless browser binaries. For production, the Docker image is recommended; it bundles Chromium, wget, and other dependencies. Scaling can be achieved by spinning multiple worker containers behind a load balancer, each consuming jobs from a shared Redis queue. Persistent storage should be backed by network‑attached volumes or cloud object stores (e.g., S3) to ensure durability. The application’s stateless API layer can be placed behind TLS termination and integrated with OAuth or LDAP for authentication.
Integration & Extensibility
- Plugins: New processors can be added by implementing the
Processorinterface and placing them inplugins/. The plugin system automatically discovers and registers them at startup. - Webhooks: Configure URLs in
settings.pyto receive POST payloads on events. Payload includes archive metadata, file paths, and status codes. - Custom UI: The Jinja templates are open source; developers can fork the repo and modify the front‑end to fit branding or add analytics dashboards.
- External Triggers: The CLI and REST API allow integration with cron jobs, CI pipelines, or monitoring tools (e.g., Prometheus exporters) to schedule periodic imports from RSS feeds, bookmark managers, or social media APIs.
Developer Experience
The codebase follows PEP 8 conventions and is heavily documented in the Wiki. The archivebox command provides interactive prompts, while the Python API can be imported in any script. Community support is active on GitHub Discussions and a dedicated Discord channel, with frequent releases under the permissive MIT license. The modular design means developers can drop in custom processors or replace the storage backend without rewriting core logic.
Use Cases
- Legal Evidence Preservation: Capture and timestamp web pages for court filings, ensuring tamper‑evident snapshots.
- Research Archiving: Bulk ingest scholarly articles, conference proceedings, and datasets from URLs or RSS feeds for reproducibility.
- Personal Backup: Archive social media posts, photo albums, or favorite sites locally before they disappear.
- Compliance & Auditing: Maintain records of public-facing web content for regulatory reporting (e.g., GDPR, HIPAA).
Advantages
- Full Control: Self‑hosted storage guarantees data sovereignty; no reliance on third‑party services.
Open SourceReady to get started?
Join the community and start self-hosting ArchiveBox today
Related Apps in other
Immich
Self‑hosted photo and video manager
Syncthing
Peer‑to‑peer file sync, no central server
Strapi
Open-source headless CMS for modern developers
reveal.js
Create stunning web‑based presentations with HTML, CSS and JavaScript
Stirling-PDF
Local web PDF editor with split, merge, convert and more
MinIO
Fast, S3-compatible object storage for AI and analytics
Weekly Views
Repository Health
Information
Explore More Apps
Feedpushr
A lightweight, pluggable feed aggregator and delivery engine
SimpleLogin
Protect your identity with email aliases
Usertour
Open‑source in‑app onboarding platform
Runtipi
All‑in‑one self‑hosted app orchestrator
Engity's Bifröst
Advanced, OpenID‑connected SSH server for containers and Kubernetes
Beets
Organize, tag, and play your music library effortlessly