mcp-vision

MCP Server

Zero‑shot object detection for LLMs and vision‑language models

Stale(55)

44stars

2views

Updated 24 days ago

About

mcp-vision is an MCP server that exposes HuggingFace zero‑shot object detection models as tools, enabling large language or vision‑language models to locate and zoom into objects within images.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

mcp-vision in action

mcp‑vision is a Model Context Protocol (MCP) server that turns HuggingFace computer‑vision models into first‑class tools for large language or vision‑language assistants. It exposes zero‑shot object detection pipelines as callable commands, allowing an AI assistant to identify and isolate objects in images without requiring a pre‑trained classifier for each category. This capability is especially valuable when the assistant needs to reason about visual content that contains an arbitrary set of objects—something that traditional vision models, which are limited to a fixed label space, struggle with.

At its core, the server offers two primary tools. scans an image and returns a list of detected objects, each annotated with bounding boxes, confidence scores, and the labels chosen from a user‑supplied list of candidate strings. By leveraging HuggingFace’s zero‑shot pipelines (e.g., ), the assistant can ask “Where is the coffee mug?” or “Show me all bicycles in this photo,” and receive precise coordinates without any additional training. builds on this by cropping the image around a specified object, returning a new image that focuses on that element. This is particularly useful for detailed inspection or for feeding the cropped region into a downstream model that expects a smaller, centered input.

Developers can integrate mcp‑vision seamlessly into their AI workflows. In a typical setup, the MCP server runs in Docker—either locally on a GPU‑enabled machine or via a public image—and is registered in the Claude Desktop configuration. Once active, any prompt that references visual reasoning can invoke these tools via standard MCP calls. The assistant then receives structured JSON results (for detection) or an image payload (for zoom), which it can embed in its response, pass to another model, or use for further analysis.

Real‑world scenarios include e‑commerce product tagging, automated inspection in manufacturing, or interactive educational tools where students ask questions about diagrams or photos. Because the server relies on zero‑shot detection, adding new object categories is as simple as extending the candidate label list—no model retraining required. This flexibility, combined with the ease of deployment and tight integration with MCP‑compatible assistants, makes mcp‑vision a powerful addition to any developer’s AI toolkit.