CVDLT MCP Server

MCP Server

AI vision server for detection, segmentation and pose estimation

Stale(50)

3stars

3views

Updated Sep 14, 2025

About

A Python MCP server that uses YOLOv10, YOLOv8 and Ultralytics SAM to detect objects, segment images, and estimate human poses from local or network image inputs via stdio or SSE.

Capabilities

Resources

Access data sources

Tools

Execute functions

Prompts

Pre-built templates

Sampling

AI model interactions

样式图

MCP Server for CVDLT (Computer Vision & Deep Learning Tools) is a ready‑to‑run Model Context Protocol server that exposes state‑of‑the‑art computer vision capabilities to AI assistants such as Claude. By packaging popular Ultralytics models—YOLOv10 for detection, YOLOv8 for segmentation and pose estimation, and SAM for image‑level segmentation—into a single MCP interface, the server eliminates the need for developers to manage individual model deployments or craft custom APIs. This integration empowers conversational agents to ask a user for an image and receive structured, machine‑readable results without any additional code.

The server solves a common pain point in AI‑augmented workflows: bridging the gap between raw image data and actionable insights. Traditional vision pipelines require downloading models, handling GPU resources, and writing inference scripts for each task. With MCP Server CVDLT, developers can invoke complex vision operations through simple tool calls defined in the MCP schema. The server supports both local file paths and remote URLs, making it flexible for web‑based or desktop applications. It also offers two transport modes—stdio and SSE—so it can be deployed in headless environments or as a long‑running service.

Key features include:

Object detection with YOLOv10, returning bounding boxes, confidence scores, and class labels.
Object segmentation via YOLOv8, providing precise masks alongside detection metadata.
Whole‑image segmentation using Ultralytics SAM, ideal for scene understanding or background removal.
Human pose estimation with YOLOv8, delivering keypoint coordinates and confidence for each detected person.
Dual transport protocols (stdio and SSE) that allow seamless integration with both local scripts and networked clients.
Extensible tool set: each vision operation is exposed as an MCP tool, enabling dynamic discovery and invocation by AI assistants.

Real‑world use cases span a broad spectrum: an e‑commerce assistant can automatically tag product images, a security system can detect and track intruders in surveillance footage, and a photo‑editing chatbot can remove backgrounds or highlight subjects. In research settings, the server can serve as a rapid prototyping backend for multimodal models that need to process visual inputs on demand. By packaging these capabilities behind MCP, developers can focus on higher‑level application logic while trusting the server to deliver reliable, high‑performance vision inference.