Overview
Discover what makes Paperless-ngx powerful
Paperless‑ngx is a full‑stack, self‑hosted document management system written in **Python** that exposes both a RESTful API and a rich web UI. At its core, the application ingests various file types (PDF, images, Office documents, plain text), runs OCR with Tesseract to extract searchable text, and stores the processed files in PDF/A format alongside unmodified originals. The metadata graph—tags, correspondents, document types, and custom fields—is persisted in a relational database (PostgreSQL or SQLite), while the file system holds the raw and processed assets. The backend is built on **Django** with Django REST Framework, leveraging Celery for asynchronous OCR and ML tagging pipelines. Front‑end interactions are handled by a single‑page application using **React** (or a modern JavaScript framework) that communicates via the API, enabling features such as drag‑and‑drop uploads, bulk edits, and customizable dashboards.
Backend
Asynchronous Workers
Storage
API
Overview
Paperless‑ngx is a full‑stack, self‑hosted document management system written in Python that exposes both a RESTful API and a rich web UI. At its core, the application ingests various file types (PDF, images, Office documents, plain text), runs OCR with Tesseract to extract searchable text, and stores the processed files in PDF/A format alongside unmodified originals. The metadata graph—tags, correspondents, document types, and custom fields—is persisted in a relational database (PostgreSQL or SQLite), while the file system holds the raw and processed assets. The backend is built on Django with Django REST Framework, leveraging Celery for asynchronous OCR and ML tagging pipelines. Front‑end interactions are handled by a single‑page application using React (or a modern JavaScript framework) that communicates via the API, enabling features such as drag‑and‑drop uploads, bulk edits, and customizable dashboards.
Architecture
- Backend: Django + Django REST Framework. Uses PostgreSQL as the primary database, but supports SQLite for lightweight deployments.
- Asynchronous Workers: Celery workers run OCR (Tesseract) and ML models for tagging, enabling non‑blocking ingestion.
- Storage: Documents are stored on the host’s file system; paths are generated by configurable naming schemes. PDF/A conversion is performed via
pdfalibraries. - API: A comprehensive REST API exposes CRUD operations for documents, tags, correspondents, types, and custom fields. Webhooks can be configured to notify external services on document events.
- Web UI: A modern JavaScript SPA (React) consumes the API, providing a responsive interface with drag‑and‑drop, bulk editing, and customizable views.
- Containerization: Official Docker images (Compose and Helm charts) bundle the web, worker, and database services. The stack is designed for Kubernetes via Helm or simple Docker‑Compose setups.
Core Capabilities
- OCR & Text Extraction: Uses Tesseract to support >100 languages; returns selectable, searchable text.
- Machine‑Learning Tagging: Trains models on user tags to auto‑assign labels, correspondents, and document types.
- Metadata Management: Full CRUD on tags, correspondents, types, custom fields; supports bulk operations and versioning.
- API & Webhooks: Exposes endpoints for programmatic ingestion, querying, and bulk updates. Supports webhook callbacks on document lifecycle events.
- Extensibility: Plugin hooks allow developers to inject custom processors or modify ingestion pipelines. Custom naming conventions can be defined via configuration.
- Security: All data remains on the host; no external transmission unless explicitly configured. Supports HTTPS, OAuth2, and role‑based access controls.
Deployment & Infrastructure
Paperless‑ngx is optimized for self‑hosting on Linux servers, cloud droplets, or virtual machines. It requires:
- CPU: At least 2 cores for OCR workloads; higher for large batch processing.
- Memory: Minimum 4 GB RAM; recommended 8 GB+ for concurrent users.
- Storage: SSD preferred; separate volumes for media and database improve performance.
- Scalability: Horizontal scaling of workers is straightforward with Celery; database replication can be added for read‑heavy workloads.
- Container Support: Official Docker images are available; Helm charts simplify Kubernetes deployments. Persistent volumes are used to retain media and database state across restarts.
Integration & Extensibility
- REST API: Full CRUD endpoints for all entities, pagination, filtering, and search.
- Webhooks: Configurable POST callbacks on document creation, update, or deletion.
- Custom Processors: Developers can write Python modules that hook into the ingestion pipeline, e.g., to integrate with external OCR services or enrich metadata.
- CLI Tools:
paperless managecommands expose administrative tasks (import, export, cleanup) for automation scripts. - Event Bus: Celery signals can be used to trigger downstream services or notifications.
Developer Experience
Paperless‑ngx provides comprehensive documentation covering architecture, API usage, and deployment scenarios. The codebase follows Django conventions, making it familiar to Python developers. Continuous integration (GitHub Actions) ensures high code quality, and the community is active on Matrix, GitHub Discussions, and translation platforms. The project encourages contributions via feature branches, PR reviews, and issue triage.
Use Cases
- Enterprise Document Archiving: Deploy on a private cloud to ingest, index, and retrieve internal documents with full-text search.
- Personal Knowledge Base: Run on a home server to keep scanned receipts, contracts, and PDFs searchable without cloud reliance.
- Compliance Audits: Store documents in PDF/A format for long‑term retention, with audit logs exposed via API.
- Automation Pipelines: Integrate the REST API into existing workflows (e.g., mail‑to‑PDF ingestion, RPA bots) to automatically tag and archive documents.
Advantages
- Open Source & Self‑Hosted: No vendor lock‑in; all data stays on your premises.
- Python/Django Ecosystem: Leverages mature libraries, making it easier to extend or replace components.
- High Performance OCR: Tesseract integration and asynchronous workers allow fast batch processing.
- Extensible Metadata Model: Custom fields and ML tagging enable sophisticated document classification.
- Community‑Driven: Active contributors, frequent releases, and robust documentation reduce the learning
Open SourceReady to get started?
Join the community and start self-hosting Paperless-ngx today
Related Apps in other
Immich
Self‑hosted photo and video manager
Syncthing
Peer‑to‑peer file sync, no central server
Strapi
Open-source headless CMS for modern developers
reveal.js
Create stunning web‑based presentations with HTML, CSS and JavaScript
Stirling-PDF
Local web PDF editor with split, merge, convert and more
MinIO
Fast, S3-compatible object storage for AI and analytics
Weekly Views
Repository Health
Information
Explore More Apps
Grist
Hybrid database‑powered spreadsheet for modern data work
ZincSearch
Lightweight full‑text search engine for Elasticsearch users
youtube-dl-server
Web and REST interface for downloading YouTube videos
HomeBox
Fast, lightweight home inventory for all devices
Fusio
Self-hosted API Management for Builders
Syndie
Self-hosted apis-services
