Biliscribe MCP Server

MCP Server

Convert Bilibili videos to structured text for LLMs

Stale(55)

2stars

1views

Updated May 26, 2025

About

Biliscribe is an MCP server that transcribes and formats Bilibili video content into structured text, ready for large‑language‑model processing and analysis. It uses ffmpeg, Cloudflare R2 storage, and Replicate’s WhisperX for transcription.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

Overview

The Biliscribe MCP server transforms Bilibili video content into clean, structured text that is ready for large‑language‑model (LLM) consumption. By leveraging to extract audio, Cloudflare R2 for durable storage, and Replicate’s WhisperX for state‑of‑the‑art speech recognition, Biliscribe eliminates the manual effort required to transcribe and format video material. Developers can therefore feed high‑quality, token‑efficient transcripts directly into downstream AI workflows—be it summarization, question answering, or knowledge extraction.

Problem Solved

Bilibili hosts a vast array of user‑generated videos, many of which are rich in educational or entertainment content. Extracting usable text from these videos traditionally involves downloading, audio extraction, and manual transcription or using generic services that may not preserve Chinese‑language nuances. Biliscribe automates this pipeline, ensuring consistent audio handling and accurate transcription tailored to the Chinese language. It also normalizes output into JSON‑like structures that LLMs can parse without additional preprocessing.

Core Functionality

Audio extraction: Uses to pull the audio stream from any Bilibili video URL, guaranteeing compatibility with a wide range of codecs.
Cloud storage: Saves raw audio to Cloudflare R2, enabling scalable, cost‑effective persistence and easy retrieval for future processing or auditing.
Transcription: Calls Replicate’s WhisperX, which provides high‑accuracy speech‑to‑text in Chinese and supports speaker diarization if needed.
Formatting: Wraps the raw transcript into a clean, hierarchical structure that includes timestamps and speaker labels, making it immediately usable by LLMs.

Key Features

Dual‑mode communication: Supports both standard I/O (stdio) and Server‑Sent Events (SSE), allowing integration with a variety of client architectures.
Environment‑driven configuration: All credentials (Replicate token, R2 keys) are supplied via environment variables, keeping secrets out of source code.
Platform‑agnostic: Tested on macOS but designed to run wherever and the required environment variables are available.
Scalable storage: By offloading audio to R2, the server can handle large volumes of video without local disk pressure.

Use Cases

Educational content analysis: Automatically transcribe lecture videos to create searchable text corpora for students or educators.
Content moderation: Generate transcripts that can be fed into policy‑checking models to flag inappropriate material.
Data enrichment: Enhance existing video datasets with structured text, enabling multimodal training or retrieval tasks.
Automated summarization: Provide clean input to LLMs that generate concise summaries, subtitles, or highlight reels.

Integration with AI Workflows

Once a transcript is produced, it can be passed directly to any LLM via the MCP client’s or endpoints. Developers can chain Biliscribe with other MCP servers—for example, combining it with a summarization server—to build end‑to‑end pipelines that convert raw video into actionable insights. The SSE mode is especially useful for streaming transcription results to real‑time applications, such as live captioning or interactive chatbots.

Unique Advantages

Biliscribe’s tight coupling of audio extraction, cloud storage, and WhisperX transcription in a single MCP server eliminates the need for multiple external services or custom orchestration scripts. Its design prioritizes Chinese language support, a niche often underserved by generic transcription tools. By exposing a simple MCP interface, it lets developers focus on higher‑level logic rather than the intricacies of video processing.