ArchiveBox

Self-Hosted

Self‑hosted web archiving for permanent preservation

Stale(68)

25.3kstars

0views

Updated May 19, 2025

1 / 5

Overview

Discover what makes ArchiveBox powerful

ArchiveBox is a self‑hosted web archiving engine that captures, stores, and serves snapshots of URLs in multiple durable formats (HTML, PDF, PNG, WARC, SQLite, etc.). From a developer’s standpoint it is a full‑stack application written in Python that exposes a REST API, CLI, and webhooks for programmatic interaction. Its core design revolves around *idempotent ingestion*—each URL is hashed, deduplicated, and archived once, then made searchable via a lightweight SQLite index or optionally a PostgreSQL backend for larger deployments. The application is intentionally modular: ingestion pipelines are composed of plug‑in “processors” (e.g., `wget`, `curl`, headless browser renderers, media downloaders) that can be swapped or extended without touching the core.

Language & Runtime

Frameworks

Storage

Background Workers

Overview

ArchiveBox is a self‑hosted web archiving engine that captures, stores, and serves snapshots of URLs in multiple durable formats (HTML, PDF, PNG, WARC, SQLite, etc.). From a developer’s standpoint it is a full‑stack application written in Python that exposes a REST API, CLI, and webhooks for programmatic interaction. Its core design revolves around idempotent ingestion—each URL is hashed, deduplicated, and archived once, then made searchable via a lightweight SQLite index or optionally a PostgreSQL backend for larger deployments. The application is intentionally modular: ingestion pipelines are composed of plug‑in “processors” (e.g., wget, curl, headless browser renderers, media downloaders) that can be swapped or extended without touching the core.

Architecture

Language & Runtime: Python 3.10+ with asyncio support for concurrent crawling.
Frameworks: Flask‑based minimal API layer, Jinja2 templates for the web UI; optional uvicorn/ASGI support for production.
Storage: Default SQLite (file‑based) with optional PostgreSQL or MySQL for high‑volume archives. File system is the primary artifact store; each URL yields a directory of raw HTML, rendered PDFs, PNG screenshots, and extracted media.
Background Workers: Celery‑like job queue powered by multiprocessing or asyncio, with optional Redis/RabbitMQ for distributed workers.
Containerization: Official Docker image (archivebox/archivebox) ships with all dependencies and a pre‑configured entrypoint, making it trivial to run in Kubernetes or Docker Compose. Helm charts are available for cloud deployments.
Extensibility: The plugins folder contains Python modules that implement the Processor interface. Developers can write custom processors to support new data sources (e.g., API feeds, social media scrapers) or output formats. Webhooks and a robust REST API (/api/v1/...) allow integration with CI/CD pipelines, monitoring dashboards, or third‑party services.

Core Capabilities

Multi‑format Archiving: On ingestion, ArchiveBox creates a canonical snapshot (HTML+assets), renders the page to PDF/PNG via headless Chrome, extracts text, generates WARC files for archival compliance, and stores a JSON manifest.
Search & Retrieval: A full‑text search index powered by SQLite FTS5 or Elasticsearch (optional) enables instant retrieval of archived URLs. The web UI provides faceted browsing, tag support, and direct file downloads.
API & Webhooks: CRUD endpoints (/api/v1/archives) support batch ingestion, status polling, and deletion. Webhooks can be configured to trigger on archive_created or archive_failed, allowing downstream automation.
CLI & Python API: The archivebox command line mirrors the REST API and can be used in scripts or as a wrapper for other tools. The archivebox.api module exposes programmatic access to the ingestion pipeline and query functions.

Deployment & Infrastructure

Running ArchiveBox locally requires Python, a database driver, and optional headless browser binaries. For production, the Docker image is recommended; it bundles Chromium, wget, and other dependencies. Scaling can be achieved by spinning multiple worker containers behind a load balancer, each consuming jobs from a shared Redis queue. Persistent storage should be backed by network‑attached volumes or cloud object stores (e.g., S3) to ensure durability. The application’s stateless API layer can be placed behind TLS termination and integrated with OAuth or LDAP for authentication.

Integration & Extensibility

Plugins: New processors can be added by implementing the Processor interface and placing them in plugins/. The plugin system automatically discovers and registers them at startup.
Webhooks: Configure URLs in settings.py to receive POST payloads on events. Payload includes archive metadata, file paths, and status codes.
Custom UI: The Jinja templates are open source; developers can fork the repo and modify the front‑end to fit branding or add analytics dashboards.
External Triggers: The CLI and REST API allow integration with cron jobs, CI pipelines, or monitoring tools (e.g., Prometheus exporters) to schedule periodic imports from RSS feeds, bookmark managers, or social media APIs.

Developer Experience

The codebase follows PEP 8 conventions and is heavily documented in the Wiki. The archivebox command provides interactive prompts, while the Python API can be imported in any script. Community support is active on GitHub Discussions and a dedicated Discord channel, with frequent releases under the permissive MIT license. The modular design means developers can drop in custom processors or replace the storage backend without rewriting core logic.

Use Cases

Legal Evidence Preservation: Capture and timestamp web pages for court filings, ensuring tamper‑evident snapshots.
Research Archiving: Bulk ingest scholarly articles, conference proceedings, and datasets from URLs or RSS feeds for reproducibility.
Personal Backup: Archive social media posts, photo albums, or favorite sites locally before they disappear.
Compliance & Auditing: Maintain records of public-facing web content for regulatory reporting (e.g., GDPR, HIPAA).