MCP Transcribe Online Videos

MCP Server

Transcribe YouTube and Bilibili videos with timestamped output

Stale(50)

1stars

0views

Updated Apr 28, 2025

About

A FastMCP server that lets LLMs access and transcribe online videos from YouTube and Bilibili using WhisperX models, with automatic audio extraction, format conversion, and temporary file hosting via 0x0.st.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

Overview

The MCP Transcribe Online Videos server solves a common pain point for developers building AI‑powered content analysis tools: turning arbitrary online video streams into structured, timestamped text. By exposing two simple yet powerful tools— and —the server lets a language model request a transcription of any public YouTube or Bilibili video with a single API call. Behind the scenes, the server downloads the video, extracts and normalizes its audio, and forwards it to a cloud‑based WhisperX transcription service. The resulting transcript includes precise timestamps, making it immediately usable for downstream tasks such as summarization, keyword extraction, or subtitle generation.

For AI assistants, this capability is invaluable. Instead of relying on external web scraping or manual download steps, the assistant can ask the MCP server to fetch and transcribe a video on demand. The transcription output is returned in JSON, ready for further processing by the assistant or other services. This streamlines workflows where content creators need to analyze large volumes of video, educators want to generate study notes from lectures, or researchers gather data from media archives.

Key features of the server include:

Automatic audio extraction: The server uses FFmpeg to pull the audio track from any supported video URL, converting it into a format suitable for WhisperX.
Timestamped output: Each utterance is paired with start and end times, enabling fine‑grained alignment or subtitle generation.
Cloud transcription: Leveraging Replicate’s WhisperX models offloads compute from the client, allowing even modest hardware to process long videos efficiently.
Temporary file hosting: Large audio files are uploaded to a 0x0.st instance, ensuring the server can handle videos of any length without exhausting local storage.

Typical use cases span multiple domains. In education, a virtual teaching assistant could transcribe lecture recordings from YouTube and provide instant summaries or Q&A support. Content creators might generate searchable captions for their channels, improving discoverability and accessibility. Researchers can automate the extraction of spoken data from media archives for linguistic or sociological studies. Because the MCP interface is lightweight, these tools can be chained with other AI services—such as sentiment analysis or topic modeling—to build sophisticated, end‑to‑end pipelines.

Integration with existing AI workflows is straightforward. A developer can instantiate an pointing to the server’s URL and invoke the transcription tools as part of a larger prompt or task. The server’s FastMCP foundation ensures low‑latency, reliable communication, while the environment configuration allows swapping out file storage backends or adding new media sources. The roadmap hints at future enhancements—metadata extraction, local transcription options, and broader platform support—that will further broaden the server’s applicability.

In summary, the MCP Transcribe Online Videos server provides a turnkey solution for converting online video content into structured text, empowering AI assistants to offer richer media‑aware services without the overhead of manual data preparation.