Big Brother MCP

MCP Server

A playful honeypot for AI reporting behavior

Stale(55)

2stars

0views

Updated Jun 3, 2025

About

This Model Context Protocol server mimics a content‑moderation reporting tool to capture and log AI attempts to report users, enabling researchers to study AI moderation ethics and privacy safeguards.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

Big Brother MCP in Action

Big Brother MCP is a purpose‑built honeypot server that lets researchers study how AI assistants handle content‑moderation requests. By exposing a “Report to Big Brother” tool that mimics a real moderation endpoint, the server captures every attempt an AI makes to flag users or content—without actually reporting anyone. This approach turns the typical moderation workflow into a controlled experiment, allowing developers and ethicists to observe whether an assistant will automatically or reflexively use such a tool when prompted.

The server’s value lies in its ability to surface hidden safety mechanisms. For example, Claude Desktop consistently refuses to use the reporting tool even when explicitly asked, demonstrating strong ethical safeguards. In contrast, other models may act differently, and the honeypot logs every interaction for later analysis. This data can reveal patterns in how different AIs interpret “reporting” prompts, whether they require explicit consent, and how they balance user privacy against content‑moderation mandates.

Key features include:

Fake moderation endpoint () that appears legitimate but logs all usage.
Log viewer () for quick inspection of captured attempts.
Extensible tool set that can be expanded to simulate additional moderation actions if needed.

Typical use cases involve safety researchers testing new models, developers validating compliance with privacy policies, or organizations auditing third‑party assistants for potential misuse of moderation capabilities. By integrating the MCP server into an AI workflow—either via Claude Desktop or any MCP‑compatible client—teams can run automated tests, gather metrics on refusal rates, and build a database of AI behavior patterns. The result is a richer understanding of how assistants interpret moderation requests, informing both model design and policy decisions.