ElevenLabs Scribe MCP Server

MCP Server

Real‑time ASR with context‑aware transcription

Stale(50)

2stars

1views

Updated Jun 10, 2025

About

A FastAPI implementation of the Model Control Protocol for ElevenLabs' Scribe speech‑to‑text API, enabling real‑time and batch transcription with advanced context management, language detection, and event handling.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

ElevenLabs Scribe MCP Server

The ElevenLabs Scribe MCP Server brings the power of ElevenLabs’ real‑time speech‑to‑text API into the Model Control Protocol ecosystem. By exposing a full MCP implementation, it allows AI assistants to manage transcription sessions as first‑class resources, maintaining context across multiple turns and enabling sophisticated dialogue flows that depend on live voice input.

Solving the Real‑Time Transcription Gap

Traditional speech‑to‑text solutions often treat audio as a static file, requiring separate upload steps and post‑processing. This server eliminates that friction by streaming audio directly from a microphone or other source over WebSocket, delivering incremental transcription results. Developers can therefore build assistants that listen to users in real time, adjust prompts on the fly, or trigger actions as soon as a keyword is detected—all while keeping the conversation context intact.

Core Capabilities

Bidirectional Streaming: Audio is sent as a continuous stream ( messages), and the server responds with partial or final transcriptions () without waiting for the entire file to finish.
Context Management: Each session is identified by an message, allowing the assistant to preserve user intent and previous utterances across multiple requests.
Multi‑format Support: The server automatically converts common audio formats (WAV, MP3, OGG) into the format required by ElevenLabs, simplifying client code.
Language Detection & Confidence: Every transcription includes a language tag and confidence score, enabling downstream logic to handle multilingual scenarios or prompt for clarification.
Event Detection: The API can flag speech vs. non‑speech segments, useful for detecting pauses or background noise.

Real‑World Use Cases

Interactive Voice Assistants: Embed the server in a chatbot that can respond to spoken commands while maintaining conversational context.
Live Captioning: Provide real‑time captions for webinars or video calls, with the ability to adjust language settings on demand.
Transcription‑Driven Workflows: Trigger automated tasks (e.g., creating meeting notes, updating CRM records) as soon as a specific phrase is spoken.
Multilingual Support: Detect the user’s language and route the request to the appropriate model or translate on the fly, all within a single MCP session.

Integration with AI Workflows

Because it follows the Model Control Protocol, the server can be invoked by any MCP‑compliant client. The , , and messages map cleanly onto a conversational state machine, allowing AI assistants to treat transcription as another tool in their arsenal. The WebSocket endpoint () integrates seamlessly with event‑driven frameworks, while the REST endpoints (, ) provide a fallback for batch processing or health monitoring.

Distinctive Advantages

Unified Protocol: No need to juggle separate HTTP and WebSocket APIs; everything is expressed through MCP messages.
Low Latency: By streaming audio and returning partial results, the server minimizes the time between speaking and seeing text.
Extensibility: The modular design (protocol, types, ElevenLabs implementation) makes it straightforward to swap in other ASR backends or add custom processing steps.
Developer Friendly: Built on FastAPI and Uvicorn, it offers automatic OpenAPI docs and hot‑reload support, reducing the barrier to experimentation.

In summary, the ElevenLabs Scribe MCP Server equips AI assistants with robust, low‑latency speech transcription that is fully integrated into the MCP framework. Its real‑time streaming, context awareness, and rich feature set make it a compelling choice for developers building conversational applications that rely on voice input.