Google ADK Speaker Agent with ElevenLabs MCP

MCP Server

Text-to-Speech agent powered by Google ADK and ElevenLabs

Stale(55)

0stars

2views

Updated May 6, 2025

About

This MCP server enables a speaker agent to convert text into natural speech by connecting Google ADK with ElevenLabs’ TTS engine via uvx. It serves as a quick demo for integrating Gemini and ElevenLabs APIs in an async agent.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

Google ADK Speaker Agent with ElevenLabs MCP

The Google ADK Speaker Agent is a ready‑made example of how to combine Google’s Agent Development Kit (ADK) with ElevenLabs’ Model Context Protocol (MCP) server to deliver high‑quality text‑to‑speech (TTS) directly from an AI assistant. By exposing the ElevenLabs TTS service through MCP, the agent allows any Claude‑style client to request spoken output without needing to manage API keys or network plumbing, streamlining the integration of audio into conversational workflows.

This MCP server solves a common pain point for developers: bridging the gap between language models and real‑world output modalities. Instead of building custom HTTP clients or handling OAuth flows, developers can simply invoke a “text‑to‑speech” tool defined in the MCP specification. The server translates that tool call into a request to ElevenLabs’ TTS API, retrieves the synthesized audio stream, and returns it in the standard MCP response format. This abstraction lets AI assistants focus on dialogue logic while the server handles the heavy lifting of audio generation.

Key capabilities include:

Unified TTS interface: A single tool call that works across any MCP‑compatible client, regardless of programming language or platform.
Real‑time streaming: The server can stream audio chunks back to the client, enabling low‑latency playback in web or mobile UIs.
Configurable voice parameters: Voice ID, speed, pitch, and other ElevenLabs settings can be passed as arguments, giving developers fine control over the output.
Security and rate‑limiting: The server authenticates requests using an API key stored in the environment, shielding sensitive credentials from client code.

Typical use cases span a wide spectrum. In customer support bots, the agent can read out answers to users with natural‑sounding voices, improving accessibility and engagement. In educational tools, the TTS service can deliver spoken lessons or pronunciation guides. Voice‑enabled virtual assistants for IoT devices can use the same server to generate spoken notifications or alerts. Because the MCP interface is language‑agnostic, teams can integrate the service into existing Python, JavaScript, or even Rust codebases with minimal effort.

Integration is straightforward within an MCP‑based workflow. An AI assistant constructs a tool invocation payload (e.g., ) and sends it to the MCP endpoint. The server processes the request, forwards it to ElevenLabs, streams back the audio bytes, and the client can immediately play or further process the data. This decoupling allows developers to swap out underlying TTS providers without changing assistant logic, fostering modularity and future‑proofing their applications.