Say MCP Server

MCP Server

Text-to-speech for macOS via the built-in say command

Stale(65)

18stars

0views

Updated Sep 25, 2025

About

A lightweight MCP server that exposes macOS's native say command as an MCP tool, enabling text-to-speech with customizable voice, rate, volume, pitch, and background playback.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

macOS System Voice Settings

Overview

The Say MCP Server bridges the gap between AI assistants and macOS’s native text‑to‑speech (TTS) engine. By exposing the familiar command as an MCP tool, it allows developers to convert generated text into spoken audio directly from Claude or any other compliant AI client. This eliminates the need for external TTS services, reduces latency, and keeps all processing on the local machine—ideal for privacy‑sensitive or offline workflows.

What problem does it solve?

Many AI applications require auditory feedback—think accessibility features, voice‑guided tutorials, or hands‑free browsing. Traditional approaches rely on cloud TTS APIs that introduce network overhead, billing considerations, and potential privacy concerns. Say MCP Server sidesteps these issues by leveraging the built‑in macOS TTS engine, providing instant, high‑quality speech without external dependencies. It also offers granular control over voice attributes (pitch, rate, volume) through simple markup tokens, giving developers fine‑grained audio tuning without complex code.

Core capabilities

Text rendering: Accepts plain text or enriched strings with inline control tags (, , ) to adjust speech characteristics mid‑utterance.
Voice selection: Supports all system voices, including regional variations (e.g., “Rocko (Italian (Italy))”), allowing culturally appropriate or user‑preferred accents.
Background execution: The flag lets a speech task run asynchronously, freeing the AI client to perform additional actions while audio plays—essential for multitasking scenarios.
Customizable rate and volume: Exposes a wide range of rates (1–500 WPM) and dynamic volume changes, enabling expressive narration styles.

Real‑world use cases

Accessibility: Read browser content, documents, or chat responses aloud for visually impaired users.
Educational tools: Deliver spoken explanations of search results, YouTube transcripts, or note summaries in an engaging format.
Hands‑free interaction: Combine with other MCP servers (search, notes, web scraping) to create voice‑controlled assistants that can listen, process, and speak without leaving the local environment.
Multilingual narration: Switch voices to match language preferences or cultural contexts, useful in global applications.

Integration into AI workflows

Developers integrate Say MCP Server by adding a simple tool call in their agent’s prompt or code. The server’s tool can be chained after any data‑retrieval MCP, turning textual outputs into audible feedback instantly. Because the tool accepts structured parameters (text, voice, rate, background), agents can programmatically adjust speech on the fly—e.g., lowering volume during a long narration or emphasizing key points with pitch changes. This tight coupling between data generation and audio output streamlines the development of voice‑enabled assistants, reducing boilerplate and keeping the entire pipeline local.

Unique advantages

Zero‑cost, zero‑latency: No external API calls or subscription fees; speech is generated locally.
Privacy‑first: All text stays on the user’s machine, aligning with strict data‑handling policies.
Rich customization: Inline markup offers a lightweight yet powerful way to modulate speech, surpassing many cloud TTS options that require separate configuration files or API parameters.
Cross‑tool synergy: Easily paired with other MCP servers (search, notes, transcripts) to build complex, multimodal interactions that feel natural and responsive.

In summary, the Say MCP Server empowers developers to harness macOS’s robust TTS engine within AI applications, delivering instant, customizable spoken output while preserving privacy and reducing operational overhead.