About
A high‑performance MCP server that runs whisper.cpp locally on Apple Silicon, providing real‑time speech-to-text with speaker diarization and universal audio format support while keeping memory usage below 2 GB.
Capabilities

The Local Speech‑to‑Text MCP Server is a purpose‑built, high‑performance transcription engine that runs entirely on the user’s machine. By leveraging whisper.cpp and Apple Silicon’s Neural Engine, it delivers real‑time audio transcription without the latency or privacy concerns of cloud APIs. Developers can integrate this server into AI workflows to provide instant, on‑device speech understanding for chatbots, voice assistants, or any application that needs reliable text output from audio input.
This server solves the common bottleneck of external transcription services: dependency on internet connectivity, data privacy risks, and unpredictable costs. It also addresses the need for speaker diarization—the ability to distinguish between multiple speakers in a single recording—by incorporating the pyannote speaker‑diarization model. The result is a comprehensive tool that can transcribe long audio files, automatically convert various media formats (MP3, M4A, FLAC, etc.) to the 16 kHz mono waveform required by whisper.cpp, and output results in multiple human‑readable formats such as plain text, JSON, VTT, SRT, and CSV.
Key capabilities include:
- 100 % local processing for end‑to‑end privacy and zero external dependencies after initial setup.
- Apple Silicon optimization that achieves over 15× real‑time speed, outperforming many GPU‑based solutions while keeping memory usage below 2 GB.
- Automatic audio format detection and conversion powered by ffmpeg, allowing developers to accept any common media file without manual preprocessing.
- Speaker diarization that tags each utterance with speaker identifiers, essential for meeting transcripts, podcast editing, or multi‑party conversational AI.
- Multiple output formats that fit diverse downstream needs—from simple text for search indexing to VTT/SRT for captioning services.
In real‑world scenarios, the server is invaluable for developers building voice‑enabled applications that must operate offline or in privacy‑sensitive environments, such as medical transcription tools, legal dictation software, or on‑device personal assistants. It also serves as a backbone for AI pipelines that require quick, accurate transcriptions before feeding the text to language models, summarization engines, or analytics modules. By exposing its functionality through MCP tools like , , and , the server can be seamlessly invoked from any MCP‑compatible client, making it a plug‑and‑play component in sophisticated AI ecosystems.
Related Servers
MarkItDown MCP Server
Convert documents to Markdown for LLMs quickly and accurately
Context7 MCP
Real‑time, version‑specific code docs for LLMs
Playwright MCP
Browser automation via structured accessibility trees
BlenderMCP
Claude AI meets Blender for instant 3D creation
Pydantic AI
Build GenAI agents with Pydantic validation and observability
Chrome DevTools MCP
AI-powered Chrome automation and debugging
Weekly Views
Server Health
Information
Explore More Servers
Penumbra MCP Server
Context-aware model orchestration for Penumbra applications
MCP Metaso
AI-powered multi-dimensional search engine via MCP
MCP-OS
Orchestrate MCPs like OS processes—load on demand, prune idle
Gorela Developer Site MCP
AI‑powered access to Gorela API documentation
YouTube Watch Later MCP Server
Fetch your recent YouTube Watch Later videos quickly
Stellar MCP
Blockchain interactions made simple for LLMs