Paperless-ngx

Self-Hosted

Turn paper into a searchable digital archive

Active(100)

33.3kstars

0views

Updated 23 hours ago

1 / 5

Overview

Discover what makes Paperless-ngx powerful

Paperless‑ngx is a full‑stack, self‑hosted document management system written in **Python** that exposes both a RESTful API and a rich web UI. At its core, the application ingests various file types (PDF, images, Office documents, plain text), runs OCR with Tesseract to extract searchable text, and stores the processed files in PDF/A format alongside unmodified originals. The metadata graph—tags, correspondents, document types, and custom fields—is persisted in a relational database (PostgreSQL or SQLite), while the file system holds the raw and processed assets. The backend is built on **Django** with Django REST Framework, leveraging Celery for asynchronous OCR and ML tagging pipelines. Front‑end interactions are handled by a single‑page application using **React** (or a modern JavaScript framework) that communicates via the API, enabling features such as drag‑and‑drop uploads, bulk edits, and customizable dashboards.

Backend

Asynchronous Workers

Storage

API

Overview

Paperless‑ngx is a full‑stack, self‑hosted document management system written in Python that exposes both a RESTful API and a rich web UI. At its core, the application ingests various file types (PDF, images, Office documents, plain text), runs OCR with Tesseract to extract searchable text, and stores the processed files in PDF/A format alongside unmodified originals. The metadata graph—tags, correspondents, document types, and custom fields—is persisted in a relational database (PostgreSQL or SQLite), while the file system holds the raw and processed assets. The backend is built on Django with Django REST Framework, leveraging Celery for asynchronous OCR and ML tagging pipelines. Front‑end interactions are handled by a single‑page application using React (or a modern JavaScript framework) that communicates via the API, enabling features such as drag‑and‑drop uploads, bulk edits, and customizable dashboards.

Architecture

Backend: Django + Django REST Framework. Uses PostgreSQL as the primary database, but supports SQLite for lightweight deployments.
Asynchronous Workers: Celery workers run OCR (Tesseract) and ML models for tagging, enabling non‑blocking ingestion.
Storage: Documents are stored on the host’s file system; paths are generated by configurable naming schemes. PDF/A conversion is performed via pdfa libraries.
API: A comprehensive REST API exposes CRUD operations for documents, tags, correspondents, types, and custom fields. Webhooks can be configured to notify external services on document events.
Web UI: A modern JavaScript SPA (React) consumes the API, providing a responsive interface with drag‑and‑drop, bulk editing, and customizable views.
Containerization: Official Docker images (Compose and Helm charts) bundle the web, worker, and database services. The stack is designed for Kubernetes via Helm or simple Docker‑Compose setups.

Core Capabilities

OCR & Text Extraction: Uses Tesseract to support >100 languages; returns selectable, searchable text.
Machine‑Learning Tagging: Trains models on user tags to auto‑assign labels, correspondents, and document types.
Metadata Management: Full CRUD on tags, correspondents, types, custom fields; supports bulk operations and versioning.
API & Webhooks: Exposes endpoints for programmatic ingestion, querying, and bulk updates. Supports webhook callbacks on document lifecycle events.
Extensibility: Plugin hooks allow developers to inject custom processors or modify ingestion pipelines. Custom naming conventions can be defined via configuration.
Security: All data remains on the host; no external transmission unless explicitly configured. Supports HTTPS, OAuth2, and role‑based access controls.

Deployment & Infrastructure

Paperless‑ngx is optimized for self‑hosting on Linux servers, cloud droplets, or virtual machines. It requires:

CPU: At least 2 cores for OCR workloads; higher for large batch processing.
Memory: Minimum 4 GB RAM; recommended 8 GB+ for concurrent users.
Storage: SSD preferred; separate volumes for media and database improve performance.
Scalability: Horizontal scaling of workers is straightforward with Celery; database replication can be added for read‑heavy workloads.
Container Support: Official Docker images are available; Helm charts simplify Kubernetes deployments. Persistent volumes are used to retain media and database state across restarts.

Integration & Extensibility

REST API: Full CRUD endpoints for all entities, pagination, filtering, and search.
Webhooks: Configurable POST callbacks on document creation, update, or deletion.
Custom Processors: Developers can write Python modules that hook into the ingestion pipeline, e.g., to integrate with external OCR services or enrich metadata.
CLI Tools: paperless manage commands expose administrative tasks (import, export, cleanup) for automation scripts.
Event Bus: Celery signals can be used to trigger downstream services or notifications.

Developer Experience

Paperless‑ngx provides comprehensive documentation covering architecture, API usage, and deployment scenarios. The codebase follows Django conventions, making it familiar to Python developers. Continuous integration (GitHub Actions) ensures high code quality, and the community is active on Matrix, GitHub Discussions, and translation platforms. The project encourages contributions via feature branches, PR reviews, and issue triage.

Use Cases

Enterprise Document Archiving: Deploy on a private cloud to ingest, index, and retrieve internal documents with full-text search.
Personal Knowledge Base: Run on a home server to keep scanned receipts, contracts, and PDFs searchable without cloud reliance.
Compliance Audits: Store documents in PDF/A format for long‑term retention, with audit logs exposed via API.
Automation Pipelines: Integrate the REST API into existing workflows (e.g., mail‑to‑PDF ingestion, RPA bots) to automatically tag and archive documents.

Advantages

Open Source & Self‑Hosted: No vendor lock‑in; all data stays on your premises.
Python/Django Ecosystem: Leverages mature libraries, making it easier to extend or replace components.
High Performance OCR: Tesseract integration and asynchronous workers allow fast batch processing.
Extensible Metadata Model: Custom fields and ML tagging enable sophisticated document classification.
Community‑Driven: Active contributors, frequent releases, and robust documentation reduce the learning