MCPSERV.CLUB
demon24ru

Fish Speech MCP

MCP Server

Real‑time TTS for LLMs with voice cloning

Stale(50)
1stars
2views
Updated Apr 8, 2025

About

An MCP server that converts text to speech using FishSpeech, supports saving and reusing voice references, and integrates seamlessly with Dive and other MCP‑compatible LLMs.

Capabilities

Resources
Access data sources
Tools
Execute functions
Prompts
Pre-built templates
Sampling
AI model interactions

Fish Speech MCP – Text‑to‑Speech for AI Assistants

Fish Speech MCP is a lightweight Model Context Protocol server that adds real‑time text‑to‑speech (TTS) capabilities to any MCP‑compatible large language model. By exposing two simple tools— and —the server lets developers turn plain text into natural‑sounding audio, and optionally store voice samples for later cloning. The integration is seamless: a LLM can invoke the tools as part of its reasoning pipeline, and the server handles all communication with an external Optivus TTS engine over Socket.IO.

The primary problem this server solves is the lack of built‑in audio output in most conversational AI platforms. While LLMs can generate text, delivering that text as spoken dialogue requires a separate TTS service. Fish Speech MCP bridges this gap by packaging the TTS functionality as an MCP server, so developers can treat speech synthesis like any other tool—invoking it with a JSON payload and receiving an audio URL or binary data in response. This eliminates the need to write custom connectors or manage complex streaming protocols.

Key features include:

  • On‑Demand Speech Generation – Convert arbitrary text to high‑quality speech using the FishSpeech engine with a single API call.
  • Voice Reference Management – Store and retrieve voice samples to enable personalized or consistent speech across sessions.
  • MCP‑Native Integration – Works out of the box with Dive, Claude, and any other MCP‑compatible LLM, leveraging existing tool invocation workflows.
  • Automatic Socket.IO Handling – The server manages connection stability, reconnection logic, and error reporting without developer intervention.
  • Environment‑Based Configuration – Point the server to any Optivus instance via a simple environment variable, making deployment flexible across local or cloud setups.

Typical use cases span interactive voice assistants, automated customer support, and educational platforms. For example, a chat interface can ask the LLM to generate a spoken summary of an article; the LLM calls , receives an audio URL, and streams it to the user. In a multilingual bot, developers can save voice references for each language variant and reuse them to maintain consistent speaker identity across conversations.

Because the server treats TTS as a first‑class tool, developers can compose complex workflows that combine reasoning, data retrieval, and speech synthesis in a single prompt. The lightweight nature of the MCP implementation means it can run locally or be containerized for cloud deployment, providing a low‑overhead solution that scales with the needs of modern AI applications.