MCPSERV.CLUB
sist2

sist2

Self-Hosted

Fast, incremental file search with web UI and OCR support

Stale(68)
1.2kstars
0views
Updated Jul 5, 2025
sist2 screenshot 1
1 / 2

Overview

Discover what makes sist2 powerful

sist 2 is a lightweight, self‑hosted search engine designed to index and query large collections of files on disk. It operates as a two‑tier system: a **scanning engine** that extracts text, metadata, and thumbnails from files, and a **search backend** that stores the extracted data for fast retrieval. The scanning process is incremental, meaning only modified or new files are re‑processed, which keeps the update cycle short even for terabyte‑scale repositories. The application is written in Python 3 and leverages multiple open‑source libraries—`tesseract-ocr` for optical character recognition, `python-docx`, `pdfminer.six`, and others—to support a wide range of document formats. The web interface is built with Flask, providing a responsive UI that works on mobile and desktop browsers.

Language & Frameworks

Search Backends

Containerization

Data Flow

Overview

sist 2 is a lightweight, self‑hosted search engine designed to index and query large collections of files on disk. It operates as a two‑tier system: a scanning engine that extracts text, metadata, and thumbnails from files, and a search backend that stores the extracted data for fast retrieval. The scanning process is incremental, meaning only modified or new files are re‑processed, which keeps the update cycle short even for terabyte‑scale repositories. The application is written in Python 3 and leverages multiple open‑source libraries—tesseract-ocr for optical character recognition, python-docx, pdfminer.six, and others—to support a wide range of document formats. The web interface is built with Flask, providing a responsive UI that works on mobile and desktop browsers.

Architecture

  • Language & Frameworks: Python 3 (≥3.9) for core logic; Flask for the admin and web UI, with Jinja2 templates.
  • Search Backends: Dual‑mode support for Elasticsearch (v7.x+ recommended) and a lightweight SQLite index. The Elasticsearch integration uses the official elasticsearch Python client, while the SQLite mode employs a custom lightweight schema that stores tokens and metadata in a single file.
  • Containerization: Official Docker images are available for both the admin tool (sist2-admin) and the web server. Docker Compose examples show how to spin up Elasticsearch, persist data volumes, and expose ports securely.
  • Data Flow:
    1. sist2 scan reads a directory tree, extracts text/metadata, and writes a .sist2 binary file.
    2. sist2 index (or sqlite-index) consumes the binary and pushes documents into Elasticsearch or SQLite.
    3. The web UI queries the backend via REST endpoints, returning JSON to the browser where client‑side NER (spaCy or similar) can annotate results.

Core Capabilities

  • Incremental Scanning: Uses file modification timestamps and checksums to skip unchanged files, reducing CPU usage during subsequent scans.
  • Archive Extraction: Recursively processes ZIP, TAR, and other archive formats, treating contained files as part of the index.
  • OCR & Text Extraction: Integrates Tesseract for image‑based PDFs and scanned documents, exposing raw OCR text in the index.
  • Metadata & Thumbnails: Generates thumbnails for images and PDFs; stores EXIF, creation dates, tags, and custom attributes.
  • Tagging & Scripting: UI‑based manual tagging plus a scripting API (docs/scripting.md) that lets developers write Python scripts to auto‑tag files based on content or file attributes.
  • Client‑Side NER: Optional JavaScript module that runs named‑entity recognition on search results in the browser, reducing server load.

Deployment & Infrastructure

  • Self‑Hosting: Requires a machine with at least 4 GB RAM for the scanning engine and 2–3 GB for Elasticsearch. The Docker images expose ports 4090 (admin) and 8080 (web), but the admin port should be kept internal.
  • Scalability: Elasticsearch clusters can be scaled horizontally; sist 2 itself is stateless and can run behind a load balancer. The scanning component can be distributed by running multiple instances on different machines and merging their .sist2 outputs.
  • Resource Efficiency: The scanning process is multi‑threaded but keeps memory usage low (≈200 MB per worker), making it suitable for servers with limited RAM.
  • Persistence: All index data is stored on disk; Elasticsearch data directories and the SQLite file can be backed up with standard filesystem tools.

Integration & Extensibility

  • API: The web server exposes a RESTful API for querying, tagging, and job management. Developers can write custom clients or integrate sist 2 into existing CI/CD pipelines.
  • Webhooks: The admin UI can trigger webhooks on job completion, allowing downstream services (e.g., notification systems) to react.
  • Plugins: Future releases plan a plugin architecture where third‑party modules can register new file parsers or custom search filters.
  • Customization: Configuration is driven by environment variables and YAML files; the UI allows adjusting scan intervals, thread counts, and backend URLs without code changes.

Developer Experience

  • Documentation: The repository includes comprehensive docs (docs/), a usage guide, and inline code comments. API endpoints are self‑describing in the UI.
  • Community: An active Discord channel (https://discord.gg/2PEjDy3Rfs) provides real‑time support and feature discussions.
  • Licensing: Open‑source under the MIT license, giving developers full freedom to modify and redistribute.
  • Testing: CodeFactor badges indicate a well‑maintained codebase with continuous integration, ensuring reliability.

Use Cases

  • Enterprise File Search: Index corporate document repositories (PDFs, Word, spreadsheets) for internal search portals.
  • Digital Asset Management: Quickly find images or media files by metadata, tags, or OCR content.
  • Compliance Audits: Scan and index logs, reports, and audit trails for keyword search during investigations.
  • Personal Knowledge Bases: Build a local

Open SourceReady to get started?

Join the community and start self-hosting sist2 today

Weekly Views

Loading...
Support Us

Featured Project

$30/month

Get maximum visibility with featured placement and special badges

Repository Health

Loading health data...

Information

Category
other
License
GPL-3.0
Stars
1.2k
Technical Specs
Pricing
Open Source
Database
Multiple
Docker
Official
Min RAM
1GB
Supported OS
LinuxWindowsmacOSDocker
Author
sist2app
sist2app
Last Updated
Jul 5, 2025