Portuguese Legal PDF Metadata Extractor

MCP Server

Accurate metadata extraction from Portuguese legal PDFs

Stale(55)

0stars

1views

Updated Jun 2, 2025

About

A production‑ready Python tool that parses ECLI‑formatted Portuguese legal documents, extracting structured metadata with high confidence and robust error handling. It supports single or batch processing via CLI or API.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

Overview

The Portuguese Legal Document PDF Metadata Extractor is a dedicated MCP server that turns unstructured PDFs of Portuguese legal documents—especially those following the European Case Law Identifier (ECLI) format—into clean, structured metadata. By exposing a simple AI‑friendly API, it allows Claude and other assistants to retrieve key document attributes such as case number, court name, decision date, parties involved, and more without the need for manual parsing or custom OCR pipelines. This capability is especially valuable in legal tech workflows where accurate, machine‑readable data drives downstream analytics, compliance checks, or case management systems.

At its core, the server implements two extraction engines. The robust extractor performs low‑level pattern recognition on PDF layouts, leveraging fixed relative positions, synchronized column pairs, and predictable field ordering to isolate metadata tables. The production extractor wraps this engine with a user‑friendly interface, progress reporting, and optional ground‑truth validation. Developers can invoke the extractor via a simple function call or through an integrated CLI, making it suitable for both scripted batch processing and interactive use within AI assistants. The extraction logic is tuned to Portuguese legal conventions, ensuring that field names and values are interpreted correctly even when documents vary slightly in formatting.

Key capabilities include:

High accuracy: 100 % confidence scores and a 96.84 % exact match rate on benchmarked documents, achieved through heuristic confidence scoring and optional ground‑truth comparison.
Robust error handling: Validation routines detect missing or malformed fields and classify them as legitimately empty versus truly absent, providing clear feedback to the calling assistant.
Flexible deployment: The server can be launched as a lightweight Python service or integrated directly into an MCP workflow, exposing resources for metadata extraction, progress updates, and error reports.
Performance: Typical throughput of 2–3 seconds per document enables real‑time processing in interactive AI sessions or high‑volume batch jobs.

Typical use cases span legal research platforms that need to index thousands of case PDFs, compliance monitoring tools that verify metadata against regulatory standards, and AI assistants that answer user queries about specific court decisions by quickly pulling structured data. By integrating this MCP server, developers can offload the tedious task of PDF parsing to a proven, high‑accuracy extractor, freeing their AI assistants to focus on higher‑level reasoning and user interaction.