Papra

Self-Hosted

Minimalistic document archive for teams and individuals

Active(100)

2.5kstars

0views

Updated 6 hours ago

1 / 2

Overview

Discover what makes Papra powerful

Papra is a lightweight, self‑hosted document management system engineered for developers who need an extensible, API‑centric archive. At its core it exposes a RESTful service that accepts file uploads, extracts text via OCR when necessary, and indexes the resulting content for full‑text search. The platform is built with a clear separation of concerns: a **Go** microservice layer handling business logic, an **Elasticsearch** backend for search, and a **PostgreSQL** relational store for metadata. This combination gives developers predictable performance while keeping the codebase approachable.

Backend

Persistence

Storage

Overview

Papra is a lightweight, self‑hosted document management system engineered for developers who need an extensible, API‑centric archive. At its core it exposes a RESTful service that accepts file uploads, extracts text via OCR when necessary, and indexes the resulting content for full‑text search. The platform is built with a clear separation of concerns: a Go microservice layer handling business logic, an Elasticsearch backend for search, and a PostgreSQL relational store for metadata. This combination gives developers predictable performance while keeping the codebase approachable.

Architecture

Backend – A statically compiled Go server (≈ 5 MB) that implements the Papra API, authentication via JWT, and a CLI for bulk operations. The server is container‑friendly, exposing health checks and Prometheus metrics.
Search – Elasticsearch 8.x is used for vector‑based full‑text search and faceted filtering. The Go client communicates over the REST API, leveraging native analyzers for multilingual support.
Persistence – PostgreSQL 15 stores user accounts, organization data, document metadata, and tagging rules. The schema is intentionally flat to simplify migrations; the Go sqlx package handles query execution.
Storage – Files are persisted in an object store that can be swapped out via a simple interface. By default, the Docker deployment mounts a local directory (/data/files) but supports S3-compatible backends with minimal configuration.

Core Capabilities

API – CRUD operations for documents, tags, and organizations. Endpoints accept multipart/form-data for uploads and return JSON payloads with document IDs, extraction status, and search vectors.
Webhooks – Developers can register callbacks for events such as document_uploaded or tag_updated. The webhook payload is signed with HMAC to ensure authenticity.
SDK – A TypeScript SDK (available on npm) wraps the HTTP client, providing typed interfaces for all endpoints and automatic retry logic.
CLI – The papra-cli tool allows scripted ingestion, bulk tagging, and export of metadata to CSV or JSON. It reads environment variables for authentication (PAPRA_API_KEY).
Tagging Rules – Rule definitions are stored as JSON in PostgreSQL and evaluated server‑side using a lightweight expression language (go-expression). This lets developers enforce business logic without modifying the core code.

Deployment & Infrastructure

Papra ships as a single Docker Compose stack that pulls images from GitHub Container Registry. The stack comprises:

papra-api – the Go service
elasticsearch – a single-node cluster (configurable to scale horizontally)
postgres – the relational database
minio – optional S3‑compatible storage (or bind mount to host)

The stack is production‑ready: each service exposes health endpoints, supports environment‑based configuration (via .env), and can be run behind a reverse proxy such as Traefik or Nginx. For high‑availability, developers can run a multi‑node Elasticsearch cluster and replicate the PostgreSQL database using Patroni or Patroni‑Postgres.

Integration & Extensibility

Email Ingestion – Papra can generate a unique address per organization; forwarding an email to this address triggers automatic parsing of attachments and inline images.
Folder Ingestion – A background worker watches a mounted directory (or an SFTP endpoint) and ingests any new files, applying tagging rules automatically.
Custom Plugins – The API exposes a hook for “pre‑process” and “post‑process” stages. Developers can write small Go plugins that register with the service, enabling custom OCR engines or third‑party indexing services.
Webhooks & Event Bus – Besides HTTP callbacks, Papra emits events to a Kafka topic (papra-events) for downstream processing. This is ideal for integrating with CI/CD pipelines or analytics dashboards.

Developer Experience

The project emphasizes clear documentation: the official docs contain a dedicated Self‑hosting guide, an API reference with example payloads, and a Swagger UI for interactive testing. The codebase follows idiomatic Go conventions, making it straightforward to contribute or fork. Community support is active on Discord and GitHub Discussions, where feature requests are tracked in an open roadmap. Licensing under MIT ensures there are no commercial restrictions.

Use Cases

Enterprise Archive – Companies can host Papra on-premises to comply with data residency regulations while still leveraging powerful search.
DevOps Automation – CI pipelines can upload build artifacts, logs, or test reports to Papra via the CLI and then query them programmatically.
Legal & Compliance – Law firms can ingest contracts, apply automatic tagging rules (e.g., client_id, contract_type), and retrieve documents via API for case management.
Personal Knowledge Base – Hobbyists can set up a local instance to store receipts, warranties, and PDFs, using the webhooks to sync with cloud services.

Advantages

Papra’s minimalistic design reduces attack surface and resource consumption, making it ideal for constrained environments (e.g., Raspberry Pi or small VPS). Its use of standard, open‑source components (Go, PostgreSQL, Elasticsearch) ensures high performance and portability. The API-first approach gives developers full control over integration patterns, while the built‑in ingestion mechanisms lower operational overhead. Compared to monolithic commercial DMS solutions, Papra offers comparable feature sets with zero licensing fees