MCP TTS Server

MCP Server

Unified Text‑to‑Speech for Local and Cloud Engines

Stale(50)

5stars

1views

Updated Sep 24, 2025

About

A Model Context Protocol server that offers a single API for multiple TTS engines, including local Kokoro and cloud OpenAI, with real‑time streaming, voice customization, speed control, and playback management.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

MCP TTS Server Demo

Overview

The MCP TTS Server is a unified text‑to‑speech gateway that exposes both local and cloud‑based TTS engines through the Model Context Protocol. It solves the common pain point of having to manage multiple TTS back‑ends separately—whether you need low‑latency, offline synthesis with Kokoro or high‑quality, expressive voices from OpenAI. By presenting a single tool that accepts engine, speed, voice, and natural‑language instructions, the server lets AI assistants deliver spoken responses without any additional configuration.

Developers using Claude or other LLMs benefit from the server’s seamless MCP integration. The tool streams audio directly to the user, while the helper gives fine‑grained control over playback, enabling interactive dialogues that can be interrupted or restarted on demand. The ability to choose between a local engine (fast, no API key) and a cloud engine (high‑fidelity voices) gives teams flexibility to balance cost, latency, and voice quality.

Key capabilities include:

Multi‑engine support: Switch between Kokoro (local) and OpenAI (cloud) with a single function call.
Real‑time streaming: Audio is streamed as it is generated, reducing perceived wait times.
Voice customization: OpenAI TTS accepts natural‑language instructions, while Kokoro allows explicit voice selection.
Speed control: Playback speed can be tuned from 0.8 to 1.5, useful for pacing or accessibility.
Playback management: Stop and clear the queue instantly, ensuring smooth conversational flow.

Typical use cases span interactive chatbots that speak responses, educational tools that narrate content, and accessibility solutions that read web pages aloud. In a production pipeline, an LLM can call after generating text, then use if the user interrupts or changes the topic. The server’s MCP design means it can be dropped into any existing LLM workflow with minimal friction, providing a powerful audio layer that enhances user engagement and accessibility.