Overview
Discover what makes sist2 powerful
sist 2 is a lightweight, self‑hosted search engine designed to index and query large collections of files on disk. It operates as a two‑tier system: a **scanning engine** that extracts text, metadata, and thumbnails from files, and a **search backend** that stores the extracted data for fast retrieval. The scanning process is incremental, meaning only modified or new files are re‑processed, which keeps the update cycle short even for terabyte‑scale repositories. The application is written in Python 3 and leverages multiple open‑source libraries—`tesseract-ocr` for optical character recognition, `python-docx`, `pdfminer.six`, and others—to support a wide range of document formats. The web interface is built with Flask, providing a responsive UI that works on mobile and desktop browsers.
Language & Frameworks
Search Backends
Containerization
Data Flow
Overview
sist 2 is a lightweight, self‑hosted search engine designed to index and query large collections of files on disk. It operates as a two‑tier system: a scanning engine that extracts text, metadata, and thumbnails from files, and a search backend that stores the extracted data for fast retrieval. The scanning process is incremental, meaning only modified or new files are re‑processed, which keeps the update cycle short even for terabyte‑scale repositories. The application is written in Python 3 and leverages multiple open‑source libraries—tesseract-ocr for optical character recognition, python-docx, pdfminer.six, and others—to support a wide range of document formats. The web interface is built with Flask, providing a responsive UI that works on mobile and desktop browsers.
Architecture
- Language & Frameworks: Python 3 (≥3.9) for core logic; Flask for the admin and web UI, with Jinja2 templates.
- Search Backends: Dual‑mode support for Elasticsearch (v7.x+ recommended) and a lightweight SQLite index. The Elasticsearch integration uses the official
elasticsearchPython client, while the SQLite mode employs a custom lightweight schema that stores tokens and metadata in a single file. - Containerization: Official Docker images are available for both the admin tool (
sist2-admin) and the web server. Docker Compose examples show how to spin up Elasticsearch, persist data volumes, and expose ports securely. - Data Flow:
sist2 scanreads a directory tree, extracts text/metadata, and writes a.sist2binary file.sist2 index(orsqlite-index) consumes the binary and pushes documents into Elasticsearch or SQLite.- The web UI queries the backend via REST endpoints, returning JSON to the browser where client‑side NER (spaCy or similar) can annotate results.
Core Capabilities
- Incremental Scanning: Uses file modification timestamps and checksums to skip unchanged files, reducing CPU usage during subsequent scans.
- Archive Extraction: Recursively processes ZIP, TAR, and other archive formats, treating contained files as part of the index.
- OCR & Text Extraction: Integrates Tesseract for image‑based PDFs and scanned documents, exposing raw OCR text in the index.
- Metadata & Thumbnails: Generates thumbnails for images and PDFs; stores EXIF, creation dates, tags, and custom attributes.
- Tagging & Scripting: UI‑based manual tagging plus a scripting API (
docs/scripting.md) that lets developers write Python scripts to auto‑tag files based on content or file attributes. - Client‑Side NER: Optional JavaScript module that runs named‑entity recognition on search results in the browser, reducing server load.
Deployment & Infrastructure
- Self‑Hosting: Requires a machine with at least 4 GB RAM for the scanning engine and 2–3 GB for Elasticsearch. The Docker images expose ports
4090(admin) and8080(web), but the admin port should be kept internal. - Scalability: Elasticsearch clusters can be scaled horizontally; sist 2 itself is stateless and can run behind a load balancer. The scanning component can be distributed by running multiple instances on different machines and merging their
.sist2outputs. - Resource Efficiency: The scanning process is multi‑threaded but keeps memory usage low (≈200 MB per worker), making it suitable for servers with limited RAM.
- Persistence: All index data is stored on disk; Elasticsearch data directories and the SQLite file can be backed up with standard filesystem tools.
Integration & Extensibility
- API: The web server exposes a RESTful API for querying, tagging, and job management. Developers can write custom clients or integrate sist 2 into existing CI/CD pipelines.
- Webhooks: The admin UI can trigger webhooks on job completion, allowing downstream services (e.g., notification systems) to react.
- Plugins: Future releases plan a plugin architecture where third‑party modules can register new file parsers or custom search filters.
- Customization: Configuration is driven by environment variables and YAML files; the UI allows adjusting scan intervals, thread counts, and backend URLs without code changes.
Developer Experience
- Documentation: The repository includes comprehensive docs (
docs/), a usage guide, and inline code comments. API endpoints are self‑describing in the UI. - Community: An active Discord channel (
https://discord.gg/2PEjDy3Rfs) provides real‑time support and feature discussions. - Licensing: Open‑source under the MIT license, giving developers full freedom to modify and redistribute.
- Testing: CodeFactor badges indicate a well‑maintained codebase with continuous integration, ensuring reliability.
Use Cases
- Enterprise File Search: Index corporate document repositories (PDFs, Word, spreadsheets) for internal search portals.
- Digital Asset Management: Quickly find images or media files by metadata, tags, or OCR content.
- Compliance Audits: Scan and index logs, reports, and audit trails for keyword search during investigations.
- Personal Knowledge Bases: Build a local
Open SourceReady to get started?
Join the community and start self-hosting sist2 today
Related Apps in other
Immich
Self‑hosted photo and video manager
Syncthing
Peer‑to‑peer file sync, no central server
Strapi
Open-source headless CMS for modern developers
reveal.js
Create stunning web‑based presentations with HTML, CSS and JavaScript
Stirling-PDF
Local web PDF editor with split, merge, convert and more
MinIO
Fast, S3-compatible object storage for AI and analytics
Weekly Views
Repository Health
Information
Explore More Apps
Fess
Enterprise Search Server with Built‑in Crawler
Traefik
Dynamic reverse proxy and load balancer for microservices
Shhh
Secure secret sharing, no email leaks
Mattermost
Secure, self‑hosted team collaboration with chat, voice, and AI
Tube
Self‑hosted video platform for personal or community sharing
NodeBB
Modern, real‑time community forum platform
