Video Analyzer

FreeNot checked

MCP server for video analysis — extracts transcripts, key frames with OCR, and annotated timelines from video URLs. Supports Loom and direct video files (.mp4,

by guimatheus92

GitHub Embed

About

MCP server for video analysis — extracts transcripts, key frames with OCR, and annotated timelines from video URLs. Supports Loom and direct video files (.mp4, .webm). Zero auth required.

README

Featured in awesome-mcp-servers.

MCP server for video analysis — extracts transcripts, key frames, and metadata from video URLs and local video files. Supports Loom, direct video URLs (.mp4, .mov, .mkv, .webm, and other common formats), and absolute paths to local video files.

No existing video MCP combines transcripts + visual frames + metadata in one tool. This one does.

Installation

Prerequisites

Node.js 18+ — required to run the server via npx
yt-dlp (optional) — enables frame extraction via ffmpeg. Install with pip install yt-dlp
Chrome/Chromium (optional) — fallback for frame extraction if yt-dlp is unavailable

Without yt-dlp or Chrome, the server still works — you'll get transcripts, metadata, and comments, just no frames.

Claude Code (CLI)

claude mcp add video-analyzer -- npx mcp-video-analyzer@latest

Then restart Claude Code or start a new conversation.

VS Code / Cursor

Add to your MCP settings file:

VS Code: File → Preferences → Settings → search "MCP" or edit ~/.vscode/mcp.json / %APPDATA%\Code\User\mcp.json (Windows)
Cursor: Settings → MCP Servers → Add

{
  "servers": {
    "mcp-video-analyzer": {
      "type": "stdio",
      "command": "npx",
      "args": ["mcp-video-analyzer@latest"]
    }
  }
}

Then reload the window (Ctrl+Shift+P → "Developer: Reload Window").

Claude Desktop

Add to your Claude Desktop config file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "video-analyzer": {
      "command": "npx",
      "args": ["mcp-video-analyzer@latest"]
    }
  }
}

Then restart Claude Desktop.

Verify it works

Once installed, ask your AI assistant:

Analyze this video: https://www.loom.com/share/bdebdfe44b294225ac718bad241a94fe

If the server is connected, it will automatically call the analyze_video tool.

Tools

`analyze_video` — Full video analysis

Extracts everything from a video URL in one call:

> Analyze this video: https://www.loom.com/share/abc123...

Returns:

Transcript with timestamps and speakers
Key frames extracted via scene-change detection (automatically deduplicated). For static clips with no scene cuts — e.g. talking-head Reels/Stories where only an on-screen text overlay changes — it automatically falls back to uniform temporal sampling so you still get frames (and OCR) instead of an empty result.
OCR text extracted from frames (code, error messages, UI text, prices/dates/CTAs visible on screen)
Annotated timeline merging transcript + frames + OCR into a unified "what happened when" view
Metadata (title, duration, platform)
Comments from viewers
Chapters and AI summary (when available)

The AI will automatically call this tool when it sees a video URL — no need to ask.

Options:

detail — analysis depth: "brief" (metadata + truncated transcript, no frames), "standard" (default), "detailed" (dense sampling, more frames)
fields — array of specific fields to return, e.g. ["metadata", "transcript"]. Available: metadata, transcript, frames, comments, chapters, ocrResults, timeline, aiSummary
maxFrames (1-60, default depends on detail level) — cap on extracted frames
threshold (0.0-1.0, default 0.1) — scene-change sensitivity
forceRefresh — bypass cache and re-analyze
skipFrames — skip frame extraction for transcript-only analysis
model / language / initialPrompt — per-call Whisper overrides for the transcription fallback (override WHISPER_MODEL / WHISPER_LANGUAGE / WHISPER_PROMPT for this call only — pick a heavier model or a domain glossary for one hard clip without restarting the server)

`analyze_videos` — Batch analysis

> Analyze every .mp4 in this folder

Runs analyze_video over a list of sources with a concurrency limit (default 2), returning one structured result per source — counts + warnings on success, or a per-item error on failure (one bad file never aborts the batch). Frame images are not inlined and full transcript/OCR/timeline are returned only when fields is set; otherwise you get counts. Pair with MCP_WRITE_SIDECARS=1 (below) so each video's result persists to disk and a re-run resumes instead of recomputing.

`get_transcript` — Transcript only

> Get the transcript from this video

Quick transcript extraction. Falls back to Whisper transcription when no native transcript is available. Accepts the same per-call model / language / initialPrompt overrides as analyze_video.

`get_metadata` — Metadata only

> What's this video about?

Returns metadata, comments, chapters, and AI summary without downloading the video.

`get_frames` — Frames only

> Extract frames from this video with dense sampling

Two modes:

Scene-change detection (default) — captures visual transitions
Dense sampling (dense: true) — 1 frame/sec for full coverage

`analyze_moment` — Deep-dive on a time range

> Analyze what happens between 1:30 and 2:00 in this video

Combines burst frame extraction + filtered transcript + OCR + annotated timeline for a focused segment. Use when you need to understand exactly what happens at a specific moment.

`get_frame_at` — Single frame at a timestamp

> Show me the frame at 1:23 in this video

The AI reads the transcript, spots a critical moment, and requests the exact frame to see what's on screen.

`get_frame_burst` — N frames in a time range

> Show me 10 frames between 0:15 and 0:17 of this video

For motion, vibration, animations, or fast scrolling — burst mode captures N frames in a narrow window so the AI can see frame-by-frame changes.

Detail Levels

Level	Frames	Transcript	OCR	Timeline	Use case
`brief`	None	First 10 entries	No	No	Quick check — what's this video about?
`standard`	Up to 20 (scene-change)	Full	Yes	Yes	Default — full analysis
`detailed`	Up to 60 (1fps dense)	Full	Yes	Yes	Deep analysis — every second captured

Caching

Results are cached in memory for 10 minutes. Subsequent calls with the same URL and options return instantly. Use forceRefresh: true to bypass the cache.

Persistent sidecars (resumable bulk processing)

The in-memory cache is lost on restart, which makes reprocessing a large local corpus costly. Set MCP_WRITE_SIDECARS=1 to also persist results next to each local video so the work survives restarts and can resume:

<stem>.vtt — the transcript, only when it was generated by the Whisper fallback (an existing <stem>.vtt from your own pipeline is never overwritten). A later call reuses it via the normal sidecar reader and skips Whisper entirely.
<stem>.analysis.json + <stem>.frames/ — the full result (frames + OCR + timeline), keyed by the video's mtime:size and the analysis params. On a later call with a matching stamp + params, the result is returned straight from disk (no extraction, no OCR).

This makes analyze_videos over thousands of files resumable, and lets an external GPU transcription pipeline and this MCP share results through the filesystem: the pipeline writes <stem>.vtt, and the MCP picks it up instead of running Whisper.

Supported Sources

Source	Transcript	Metadata	Comments	Frames	Auth
Loom	Yes	Yes	Yes	Yes	None
Direct URL (.mp4, .mov, .mkv, .webm, …)	No	Duration only	No	Yes	None
Direct URL + TwelveLabs	Yes (Pegasus, best-effort)	Duration floor + title	No	Yes	`TWELVELABS_API_KEY`
Local file (absolute path or `file://` URI)	Sidecar `.vtt`/`.srt` or Whisper fallback	Probed via ffmpeg (duration, dims, codec, audio presence)	No	Yes	None

Local files: pass an absolute path (e.g., /Users/you/clip.mp4) or a file:// URI as the url argument to any tool. Relative paths are rejected — the server's working directory is unpredictable from the MCP client. Note that any caller of the MCP server can ask it to read any file the server process has access to.

Sidecar transcripts: if a clip.vtt, clip.srt, clip.en.vtt, etc. lives next to clip.mp4, it's used as the transcript automatically — no Whisper roundtrip needed. SRT is converted to VTT in-memory.

Embedded subtitles: if no sidecar is found and the container has an embedded subtitle stream (common in .mkv / .mov / .mp4 from screen recorders), it's transmuxed to VTT via ffmpeg and used as the transcript.

Recognized extensions (local files and direct URLs): .mp4 .mov .mkv .webm .avi .m4v .wmv .flv .mpeg .mpg .m2ts .mts .3gp .ogv. The extension only gates routing — ffmpeg does the actual demuxing, so most common containers work. .ts is excluded to avoid colliding with TypeScript source files.

TwelveLabs Pegasus (optional)

Set the TWELVELABS_API_KEY environment variable to analyze direct video URLs with TwelveLabs Pegasus. Pegasus analyzes the video server-side (visuals and its own audio) and returns an AI-generated, timestamped transcript plus an AI summary as text — capabilities the DirectAdapter can't provide (a raw .mp4 URL has no transcript or summary on its own), and with no Whisper key required.

The transcript is best-effort LLM output, not a deterministic ASR dump: Pegasus is prompted to emit [MM:SS] line rows, and lines that don't match that shape are dropped, so wording and exact timestamps depend on the model's prompt adherence. Failures (bad key, timeout, API error) surface in the tool's warnings[] rather than silently returning an empty transcript.

The biggest win is on the text-only paths: get_transcript and get_metadata return a Pegasus transcript and summary for direct URLs — a few KB of text, no frame images, no per-frame token cost. analyze_video at detail: "standard"/"detailed" still extracts frames in addition (use detail: "brief" to stay text-only).

Long videos: the summary and full transcript share a single capped completion (max_tokens = 16384), so for very long videos the transcript may be truncated. For multi-hour content, chunking by time window is the better approach.

It's fully opt-in and non-breaking: when TWELVELABS_API_KEY is set the TwelveLabsAdapter handles direct video URLs (it registers the public URL with TwelveLabs — no upload); when it's unset, the DirectAdapter handles them exactly as before. Loom URLs are unaffected. Get a key at playground.twelvelabs.io.

Transcription (Whisper fallback)

When a source has no native transcript (no sidecar .vtt/.srt, no embedded subtitles), the audio track is transcribed with Whisper via a graceful fallback chain (in execution order):

@huggingface/transformers (JS-native, zero external deps) — opt-in only: this strategy runs first, but only when WHISPER_HF_MODEL is explicitly set. When it's unset (the default) the strategy is skipped entirely, so the CLI below wins and its WHISPER_MODEL/WHISPER_LANGUAGE settings are never silently overridden.
whisper CLI — used when a whisper executable is found (pip install -U openai-whisper). Point WHISPER_BIN at the executable if it isn't on PATH. Model via WHISPER_MODEL, language via WHISPER_LANGUAGE. The bundled ffmpeg-static is put on the CLI's PATH automatically, so no system ffmpeg is required.
OpenAI Whisper API — used when OPENAI_API_KEY is set.

Env var	Applies to	Default	Example
`WHISPER_MODEL`	`whisper` CLI	`tiny`	`small`, `medium`
`WHISPER_LANGUAGE`	`whisper` CLI / OpenAI API	auto-detect	`pt`, `en`, `es`
`WHISPER_PROMPT`	`whisper` CLI / OpenAI API	—	`Doha, Smiles, Livelo, Latam, milheiro`
`WHISPER_BIN`	`whisper` CLI	`whisper` (on PATH)	`C:/.../Scripts/whisper.exe`
`WHISPER_DEVICE`	`whisper` CLI (sent only if set)	—	`cuda`, `cpu`
`WHISPER_COMPUTE`	`whisper-ctranslate2` only	—	`float16`, `int8_float16`, `int8`
`WHISPER_BEAM_SIZE`	`whisper` CLI (sent only if set)	—	`5`
`WHISPER_WORD_TIMESTAMPS`	`whisper` CLI (sent only if set)	off	`1`
`WHISPER_HF_MODEL`	HF transformers (opt-in)	— (strategy off)	`Xenova/whisper-small`
`OPENAI_API_KEY`	OpenAI API	—	`sk-…`

The default tiny model is fast but weak for non-English audio. For Portuguese (or other non-English) sources, install the CLI and set WHISPER_MODEL=small (or medium) + WHISPER_LANGUAGE=pt for much better accuracy. Add WHISPER_PROMPT with a domain glossary (brand/place names) to fix proper nouns. You can also override model/language/initialPrompt per call on analyze_video / get_transcript / analyze_videos — no restart needed.

GPU (faster-whisper): whisper-ctranslate2 (pip install -U whisper-ctranslate2) is a drop-in CLI with the same flags plus --device cuda / --compute_type / --beam_size. Point WHISPER_BIN at it and set WHISPER_DEVICE=cuda (+ optionally WHISPER_COMPUTE=float16). These GPU flags are env-gated — they're only passed when set, so plain openai-whisper (which rejects --compute_type) keeps working when they're unset.

Windows note: pip installs whisper.exe into the Python Scripts/ dir, which is often not on the PATH that GUI-launched MCP clients inherit. If transcripts come back empty, set WHISPER_BIN to the full path of whisper.exe.

Frame Extraction Strategies

Frame extraction uses a two-strategy fallback chain — no single dependency is required:

Strategy	How it works	Speed	Requirements
yt-dlp + ffmpeg (primary)	Downloads video, extracts frames via scene detection	Fast, precise	yt-dlp (`pip install yt-dlp`)
Browser (fallback)	Opens video in headless Chrome, seeks to timestamps, takes screenshots	Slower, no download needed	Chrome or Chromium installed

The fallback is automatic — if yt-dlp is not available, the server tries browser-based extraction via puppeteer-core. If neither is available, analysis still returns transcript + metadata + comments, just no frames.

Post-Processing Pipeline

After frame extraction, the pipeline automatically applies:

Step	What it does	Why
Frame deduplication	Removes near-identical consecutive frames using perceptual hashing (dHash + Hamming distance)	Screencasts often have long static moments — dedup removes redundant frames, saving tokens
OCR	Extracts text visible on screen from each frame (via tesseract.js). Each frame is first preprocessed — grayscale + 2× upscale + contrast normalization + sharpen — which materially improves accuracy on stylized overlays (prices, dates, coupons, CTAs).	Captures code, error messages, terminal output, UI text that the transcript doesn't cover
Annotated timeline	Merges transcript timestamps + frame timestamps + OCR text into a single chronological view	Gives the AI a unified "what was said, what changed visually, and what text appeared" at each moment

The OCR step requires tesseract.js (included as a dependency). If it fails to load, analysis continues without OCR — no frames or transcript are lost. OCR preprocessing is on by default; set MCP_OCR_PREPROCESS=0 to OCR the raw frames instead.

Complementary Tools

Chrome DevTools MCP

For live web debugging alongside video analysis, pair this server with the Chrome DevTools MCP:

claude mcp add chrome-devtools npx @anthropic-ai/mcp-devtools@latest

When to use each:

Scenario	Tool
Bug report recorded as a Loom video	`mcp-video-analyzer` — extract transcript, frames, and error text from the recording
Live debugging a web page	Chrome DevTools MCP — inspect DOM, console, network, take screenshots
Video shows UI issue, need to reproduce it	Use both: analyze the video first, then open the page in Chrome DevTools to reproduce

The two MCPs complement each other: video analyzer understands recorded content, DevTools interacts with live pages.

Example Output

The examples/loom-demo/ folder contains real outputs from analyzing a public Loom video (Boost In-App Demo Video, 2:55).

File	What it shows
metadata.json	Title, duration, platform
transcript.json	42 timestamped entries with speaker IDs
timeline.json	Unified chronological view (transcript + frames merged)
moment-transcript-0m30s-0m45s.json	Filtered transcript for `analyze_moment` (0:30–0:45)
full-analysis.json	Complete `analyze_video` output

Frame images (19 total in examples/loom-demo/frames/):

scene_*.jpg — scene-change detection (key visual transitions)
dense_*.jpg — 1fps dense sampling (every 10th frame saved as sample)
burst_*.jpg — burst extraction for moment analysis (0:30–0:45)

Regenerate after changes: npx tsx examples/generate.ts — requires yt-dlp + network access.

Development

# Install dependencies
npm install

# Run all checks (format, lint, typecheck, knip, tests)
npm run check

# Build
npm run build

# Run E2E tests (requires network)
npm run test:e2e

# Open MCP Inspector for manual testing
npm run inspect

Architecture

src/
├── index.ts                    # Entry point (shebang + stdio)
├── server.ts                   # FastMCP server + tool registration
├── tools/                      # MCP tool definitions (7 tools)
│   ├── analyze-video.ts        # Full analysis with detail levels + caching
│   ├── analyze-moment.ts       # Deep-dive on a time range
│   ├── get-transcript.ts       # Transcript-only with Whisper fallback
│   ├── get-metadata.ts         # Metadata + comments + chapters
│   ├── get-frames.ts           # Frames-only (scene-change or dense)
│   ├── get-frame-at.ts         # Single frame at timestamp
│   └── get-frame-burst.ts      # N frames in a time range
├── adapters/                   # Source-specific logic
│   ├── adapter.interface.ts    # IVideoAdapter interface + registry
│   ├── loom.adapter.ts         # Loom: authless GraphQL
│   ├── local-file.adapter.ts   # Local files: absolute path or file:// URI
│   ├── twelvelabs.adapter.ts   # TwelveLabs Pegasus: transcript + AI summary (opt-in)
│   └── direct.adapter.ts       # Direct URL: any mp4/webm link
├── processors/                 # Shared processing
│   ├── frame-extractor.ts      # ffmpeg scene detection + dense + burst extraction
│   ├── browser-frame-extractor.ts # Headless Chrome fallback for frames
│   ├── audio-transcriber.ts    # Whisper fallback (HF transformers → CLI → OpenAI)
│   ├── image-optimizer.ts      # sharp resize/compress
│   ├── frame-dedup.ts          # Perceptual dedup (dHash + Hamming distance)
│   ├── frame-ocr.ts            # OCR text extraction (tesseract.js)
│   └── annotated-timeline.ts   # Unified timeline (transcript + frames + OCR)
├── config/
│   └── detail-levels.ts        # brief / standard / detailed config
├── utils/
│   ├── cache.ts                # In-memory TTL cache with LRU eviction
│   ├── field-filter.ts         # Selective field filtering for responses
│   ├── url-detector.ts         # Platform detection from URL
│   ├── vtt-parser.ts           # WebVTT → transcript entries
│   └── temp-files.ts           # Temp directory management
└── types.ts                    # Shared TypeScript interfaces

License

MIT

from github.com/guimatheus92/mcp-video-analyzer

Install Video Analyzer in Claude Desktop, Claude Code & Cursor

Run in your terminal:

claude mcp add mcp-video-analyzer -- npx

FAQ

Is Video Analyzer MCP free?

Yes, Video Analyzer MCP is free — one-click install via Unyly at no cost.

Does Video Analyzer need an API key?

No, Video Analyzer runs without API keys or environment variables.

Is Video Analyzer hosted or self-hosted?

Self-hosted: the server runs locally on your machine via the install command above.

How do I install Video Analyzer in Claude Desktop, Claude Code or Cursor?

Open Video Analyzer on unyly.org, pick your client tab (Claude Desktop, Claude Code, Cursor) and press Install — the config is generated automatically, no JSON editing.

Related MCPs

YouTube

Transcripts, channel stats, search

by YouTube

4.33.4K

EverArt

AI image generation using various models.

by modelcontextprotocol

gpu-bridge/mcp-server

Unified GPU inference API with 30 AI services (LLM, image gen, video, TTS, whisper, embeddings, reranking, OCR) as MCP tools. Pay-per-use via x402 USDC or API k

by gpu-bridge

hamflx/imagen3-mcp

A powerful image generation tool using Google's Imagen 3.0 API through MCP. Generate high-quality images from text prompts with advanced photography, artistic,

by hamflx

Compare Video Analyzer with

Video AnalyzervsYouTube Video AnalyzervsEverArt Video Analyzervsgpu-bridge/mcp-server Video Analyzervshamflx/imagen3-mcp

Not sure what to pick?

Find your stack in 60 seconds

Author?

Embed badge for your README

Browse similar

All media MCPs

Browse all

Video Analyzer

FreeNot checked

MCP server for video analysis — extracts transcripts, key frames with OCR, and annotated timelines from video URLs. Supports Loom and direct video files (.mp4,

by guimatheus92

GitHub Embed

About

MCP server for video analysis — extracts transcripts, key frames with OCR, and annotated timelines from video URLs. Supports Loom and direct video files (.mp4, .webm). Zero auth required.

README

Featured in awesome-mcp-servers.

No existing video MCP combines transcripts + visual frames + metadata in one tool. This one does.

Installation

Prerequisites

Node.js 18+ — required to run the server via npx
yt-dlp (optional) — enables frame extraction via ffmpeg. Install with pip install yt-dlp
Chrome/Chromium (optional) — fallback for frame extraction if yt-dlp is unavailable

Without yt-dlp or Chrome, the server still works — you'll get transcripts, metadata, and comments, just no frames.

Claude Code (CLI)

claude mcp add video-analyzer -- npx mcp-video-analyzer@latest

Then restart Claude Code or start a new conversation.

VS Code / Cursor

Add to your MCP settings file:

VS Code: File → Preferences → Settings → search "MCP" or edit ~/.vscode/mcp.json / %APPDATA%\Code\User\mcp.json (Windows)
Cursor: Settings → MCP Servers → Add

{
  "servers": {
    "mcp-video-analyzer": {
      "type": "stdio",
      "command": "npx",
      "args": ["mcp-video-analyzer@latest"]
    }
  }
}

Then reload the window (Ctrl+Shift+P → "Developer: Reload Window").

Claude Desktop

Add to your Claude Desktop config file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "video-analyzer": {
      "command": "npx",
      "args": ["mcp-video-analyzer@latest"]
    }
  }
}

Then restart Claude Desktop.

Verify it works

Once installed, ask your AI assistant:

Analyze this video: https://www.loom.com/share/bdebdfe44b294225ac718bad241a94fe

If the server is connected, it will automatically call the analyze_video tool.

Tools

`analyze_video` — Full video analysis

Extracts everything from a video URL in one call:

> Analyze this video: https://www.loom.com/share/abc123...

Returns:

Transcript with timestamps and speakers
Key frames extracted via scene-change detection (automatically deduplicated). For static clips with no scene cuts — e.g. talking-head Reels/Stories where only an on-screen text overlay changes — it automatically falls back to uniform temporal sampling so you still get frames (and OCR) instead of an empty result.
OCR text extracted from frames (code, error messages, UI text, prices/dates/CTAs visible on screen)
Annotated timeline merging transcript + frames + OCR into a unified "what happened when" view
Metadata (title, duration, platform)
Comments from viewers
Chapters and AI summary (when available)

The AI will automatically call this tool when it sees a video URL — no need to ask.

Options:

detail — analysis depth: "brief" (metadata + truncated transcript, no frames), "standard" (default), "detailed" (dense sampling, more frames)
fields — array of specific fields to return, e.g. ["metadata", "transcript"]. Available: metadata, transcript, frames, comments, chapters, ocrResults, timeline, aiSummary
maxFrames (1-60, default depends on detail level) — cap on extracted frames
threshold (0.0-1.0, default 0.1) — scene-change sensitivity
forceRefresh — bypass cache and re-analyze
skipFrames — skip frame extraction for transcript-only analysis
model / language / initialPrompt — per-call Whisper overrides for the transcription fallback (override WHISPER_MODEL / WHISPER_LANGUAGE / WHISPER_PROMPT for this call only — pick a heavier model or a domain glossary for one hard clip without restarting the server)

`analyze_videos` — Batch analysis

> Analyze every .mp4 in this folder

`get_transcript` — Transcript only

> Get the transcript from this video

Quick transcript extraction. Falls back to Whisper transcription when no native transcript is available. Accepts the same per-call model / language / initialPrompt overrides as analyze_video.

`get_metadata` — Metadata only

> What's this video about?

Returns metadata, comments, chapters, and AI summary without downloading the video.

`get_frames` — Frames only

> Extract frames from this video with dense sampling

Two modes:

Scene-change detection (default) — captures visual transitions
Dense sampling (dense: true) — 1 frame/sec for full coverage

`analyze_moment` — Deep-dive on a time range

> Analyze what happens between 1:30 and 2:00 in this video

Combines burst frame extraction + filtered transcript + OCR + annotated timeline for a focused segment. Use when you need to understand exactly what happens at a specific moment.

`get_frame_at` — Single frame at a timestamp

> Show me the frame at 1:23 in this video

The AI reads the transcript, spots a critical moment, and requests the exact frame to see what's on screen.

`get_frame_burst` — N frames in a time range

> Show me 10 frames between 0:15 and 0:17 of this video

For motion, vibration, animations, or fast scrolling — burst mode captures N frames in a narrow window so the AI can see frame-by-frame changes.

Detail Levels

Level	Frames	Transcript	OCR	Timeline	Use case
`brief`	None	First 10 entries	No	No	Quick check — what's this video about?
`standard`	Up to 20 (scene-change)	Full	Yes	Yes	Default — full analysis
`detailed`	Up to 60 (1fps dense)	Full	Yes	Yes	Deep analysis — every second captured

Caching

Results are cached in memory for 10 minutes. Subsequent calls with the same URL and options return instantly. Use forceRefresh: true to bypass the cache.

Persistent sidecars (resumable bulk processing)

<stem>.vtt — the transcript, only when it was generated by the Whisper fallback (an existing <stem>.vtt from your own pipeline is never overwritten). A later call reuses it via the normal sidecar reader and skips Whisper entirely.
<stem>.analysis.json + <stem>.frames/ — the full result (frames + OCR + timeline), keyed by the video's mtime:size and the analysis params. On a later call with a matching stamp + params, the result is returned straight from disk (no extraction, no OCR).

Supported Sources

Source	Transcript	Metadata	Comments	Frames	Auth
Loom	Yes	Yes	Yes	Yes	None
Direct URL (.mp4, .mov, .mkv, .webm, …)	No	Duration only	No	Yes	None
Direct URL + TwelveLabs	Yes (Pegasus, best-effort)	Duration floor + title	No	Yes	`TWELVELABS_API_KEY`
Local file (absolute path or `file://` URI)	Sidecar `.vtt`/`.srt` or Whisper fallback	Probed via ffmpeg (duration, dims, codec, audio presence)	No	Yes	None

Local files: pass an absolute path (e.g., /Users/you/clip.mp4) or a file:// URI as the url argument to any tool. Relative paths are rejected — the server's working directory is unpredictable from the MCP client. Note that any caller of the MCP server can ask it to read any file the server process has access to.

Sidecar transcripts: if a clip.vtt, clip.srt, clip.en.vtt, etc. lives next to clip.mp4, it's used as the transcript automatically — no Whisper roundtrip needed. SRT is converted to VTT in-memory.

Embedded subtitles: if no sidecar is found and the container has an embedded subtitle stream (common in .mkv / .mov / .mp4 from screen recorders), it's transmuxed to VTT via ffmpeg and used as the transcript.

Recognized extensions (local files and direct URLs): .mp4 .mov .mkv .webm .avi .m4v .wmv .flv .mpeg .mpg .m2ts .mts .3gp .ogv. The extension only gates routing — ffmpeg does the actual demuxing, so most common containers work. .ts is excluded to avoid colliding with TypeScript source files.

TwelveLabs Pegasus (optional)

Long videos: the summary and full transcript share a single capped completion (max_tokens = 16384), so for very long videos the transcript may be truncated. For multi-hour content, chunking by time window is the better approach.

Transcription (Whisper fallback)

When a source has no native transcript (no sidecar .vtt/.srt, no embedded subtitles), the audio track is transcribed with Whisper via a graceful fallback chain (in execution order):

@huggingface/transformers (JS-native, zero external deps) — opt-in only: this strategy runs first, but only when WHISPER_HF_MODEL is explicitly set. When it's unset (the default) the strategy is skipped entirely, so the CLI below wins and its WHISPER_MODEL/WHISPER_LANGUAGE settings are never silently overridden.
whisper CLI — used when a whisper executable is found (pip install -U openai-whisper). Point WHISPER_BIN at the executable if it isn't on PATH. Model via WHISPER_MODEL, language via WHISPER_LANGUAGE. The bundled ffmpeg-static is put on the CLI's PATH automatically, so no system ffmpeg is required.
OpenAI Whisper API — used when OPENAI_API_KEY is set.

Env var	Applies to	Default	Example
`WHISPER_MODEL`	`whisper` CLI	`tiny`	`small`, `medium`
`WHISPER_LANGUAGE`	`whisper` CLI / OpenAI API	auto-detect	`pt`, `en`, `es`
`WHISPER_PROMPT`	`whisper` CLI / OpenAI API	—	`Doha, Smiles, Livelo, Latam, milheiro`
`WHISPER_BIN`	`whisper` CLI	`whisper` (on PATH)	`C:/.../Scripts/whisper.exe`
`WHISPER_DEVICE`	`whisper` CLI (sent only if set)	—	`cuda`, `cpu`
`WHISPER_COMPUTE`	`whisper-ctranslate2` only	—	`float16`, `int8_float16`, `int8`
`WHISPER_BEAM_SIZE`	`whisper` CLI (sent only if set)	—	`5`
`WHISPER_WORD_TIMESTAMPS`	`whisper` CLI (sent only if set)	off	`1`
`WHISPER_HF_MODEL`	HF transformers (opt-in)	— (strategy off)	`Xenova/whisper-small`
`OPENAI_API_KEY`	OpenAI API	—	`sk-…`

The default tiny model is fast but weak for non-English audio. For Portuguese (or other non-English) sources, install the CLI and set WHISPER_MODEL=small (or medium) + WHISPER_LANGUAGE=pt for much better accuracy. Add WHISPER_PROMPT with a domain glossary (brand/place names) to fix proper nouns. You can also override model/language/initialPrompt per call on analyze_video / get_transcript / analyze_videos — no restart needed.

GPU (faster-whisper): whisper-ctranslate2 (pip install -U whisper-ctranslate2) is a drop-in CLI with the same flags plus --device cuda / --compute_type / --beam_size. Point WHISPER_BIN at it and set WHISPER_DEVICE=cuda (+ optionally WHISPER_COMPUTE=float16). These GPU flags are env-gated — they're only passed when set, so plain openai-whisper (which rejects --compute_type) keeps working when they're unset.

Windows note: pip installs whisper.exe into the Python Scripts/ dir, which is often not on the PATH that GUI-launched MCP clients inherit. If transcripts come back empty, set WHISPER_BIN to the full path of whisper.exe.

Frame Extraction Strategies

Frame extraction uses a two-strategy fallback chain — no single dependency is required:

Strategy	How it works	Speed	Requirements
yt-dlp + ffmpeg (primary)	Downloads video, extracts frames via scene detection	Fast, precise	yt-dlp (`pip install yt-dlp`)
Browser (fallback)	Opens video in headless Chrome, seeks to timestamps, takes screenshots	Slower, no download needed	Chrome or Chromium installed

Post-Processing Pipeline

After frame extraction, the pipeline automatically applies:

Step	What it does	Why
Frame deduplication	Removes near-identical consecutive frames using perceptual hashing (dHash + Hamming distance)	Screencasts often have long static moments — dedup removes redundant frames, saving tokens
OCR	Extracts text visible on screen from each frame (via tesseract.js). Each frame is first preprocessed — grayscale + 2× upscale + contrast normalization + sharpen — which materially improves accuracy on stylized overlays (prices, dates, coupons, CTAs).	Captures code, error messages, terminal output, UI text that the transcript doesn't cover
Annotated timeline	Merges transcript timestamps + frame timestamps + OCR text into a single chronological view	Gives the AI a unified "what was said, what changed visually, and what text appeared" at each moment

Complementary Tools

Chrome DevTools MCP

For live web debugging alongside video analysis, pair this server with the Chrome DevTools MCP:

claude mcp add chrome-devtools npx @anthropic-ai/mcp-devtools@latest

When to use each:

Scenario	Tool
Bug report recorded as a Loom video	`mcp-video-analyzer` — extract transcript, frames, and error text from the recording
Live debugging a web page	Chrome DevTools MCP — inspect DOM, console, network, take screenshots
Video shows UI issue, need to reproduce it	Use both: analyze the video first, then open the page in Chrome DevTools to reproduce

The two MCPs complement each other: video analyzer understands recorded content, DevTools interacts with live pages.

Example Output

The examples/loom-demo/ folder contains real outputs from analyzing a public Loom video (Boost In-App Demo Video, 2:55).

File	What it shows
metadata.json	Title, duration, platform
transcript.json	42 timestamped entries with speaker IDs
timeline.json	Unified chronological view (transcript + frames merged)
moment-transcript-0m30s-0m45s.json	Filtered transcript for `analyze_moment` (0:30–0:45)
full-analysis.json	Complete `analyze_video` output

Frame images (19 total in examples/loom-demo/frames/):

scene_*.jpg — scene-change detection (key visual transitions)
dense_*.jpg — 1fps dense sampling (every 10th frame saved as sample)
burst_*.jpg — burst extraction for moment analysis (0:30–0:45)

Regenerate after changes: npx tsx examples/generate.ts — requires yt-dlp + network access.

Development

# Install dependencies
npm install

# Run all checks (format, lint, typecheck, knip, tests)
npm run check

# Build
npm run build

# Run E2E tests (requires network)
npm run test:e2e

# Open MCP Inspector for manual testing
npm run inspect

Architecture

src/
├── index.ts                    # Entry point (shebang + stdio)
├── server.ts                   # FastMCP server + tool registration
├── tools/                      # MCP tool definitions (7 tools)
│   ├── analyze-video.ts        # Full analysis with detail levels + caching
│   ├── analyze-moment.ts       # Deep-dive on a time range
│   ├── get-transcript.ts       # Transcript-only with Whisper fallback
│   ├── get-metadata.ts         # Metadata + comments + chapters
│   ├── get-frames.ts           # Frames-only (scene-change or dense)
│   ├── get-frame-at.ts         # Single frame at timestamp
│   └── get-frame-burst.ts      # N frames in a time range
├── adapters/                   # Source-specific logic
│   ├── adapter.interface.ts    # IVideoAdapter interface + registry
│   ├── loom.adapter.ts         # Loom: authless GraphQL
│   ├── local-file.adapter.ts   # Local files: absolute path or file:// URI
│   ├── twelvelabs.adapter.ts   # TwelveLabs Pegasus: transcript + AI summary (opt-in)
│   └── direct.adapter.ts       # Direct URL: any mp4/webm link
├── processors/                 # Shared processing
│   ├── frame-extractor.ts      # ffmpeg scene detection + dense + burst extraction
│   ├── browser-frame-extractor.ts # Headless Chrome fallback for frames
│   ├── audio-transcriber.ts    # Whisper fallback (HF transformers → CLI → OpenAI)
│   ├── image-optimizer.ts      # sharp resize/compress
│   ├── frame-dedup.ts          # Perceptual dedup (dHash + Hamming distance)
│   ├── frame-ocr.ts            # OCR text extraction (tesseract.js)
│   └── annotated-timeline.ts   # Unified timeline (transcript + frames + OCR)
├── config/
│   └── detail-levels.ts        # brief / standard / detailed config
├── utils/
│   ├── cache.ts                # In-memory TTL cache with LRU eviction
│   ├── field-filter.ts         # Selective field filtering for responses
│   ├── url-detector.ts         # Platform detection from URL
│   ├── vtt-parser.ts           # WebVTT → transcript entries
│   └── temp-files.ts           # Temp directory management
└── types.ts                    # Shared TypeScript interfaces

License

MIT

from github.com/guimatheus92/mcp-video-analyzer

Install Video Analyzer in Claude Desktop, Claude Code & Cursor

Run in your terminal:

claude mcp add mcp-video-analyzer -- npx

FAQ

Is Video Analyzer MCP free?

Yes, Video Analyzer MCP is free — one-click install via Unyly at no cost.

Does Video Analyzer need an API key?

No, Video Analyzer runs without API keys or environment variables.

Is Video Analyzer hosted or self-hosted?

Self-hosted: the server runs locally on your machine via the install command above.

How do I install Video Analyzer in Claude Desktop, Claude Code or Cursor?

Open Video Analyzer on unyly.org, pick your client tab (Claude Desktop, Claude Code, Cursor) and press Install — the config is generated automatically, no JSON editing.

Related MCPs

YouTube

Transcripts, channel stats, search

by YouTube

4.33.4K

EverArt

AI image generation using various models.

by modelcontextprotocol

gpu-bridge/mcp-server

Unified GPU inference API with 30 AI services (LLM, image gen, video, TTS, whisper, embeddings, reranking, OCR) as MCP tools. Pay-per-use via x402 USDC or API k

by gpu-bridge

hamflx/imagen3-mcp

A powerful image generation tool using Google's Imagen 3.0 API through MCP. Generate high-quality images from text prompts with advanced photography, artistic,

by hamflx

Compare Video Analyzer with

Video AnalyzervsYouTube Video AnalyzervsEverArt Video Analyzervsgpu-bridge/mcp-server Video Analyzervshamflx/imagen3-mcp

Not sure what to pick?

Find your stack in 60 seconds

Author?

Embed badge for your README

Browse similar

All media MCPs

Command Palette

Video Analyzer

About

README

Installation

Prerequisites

Claude Code (CLI)

VS Code / Cursor

Claude Desktop

Verify it works

Tools

analyze_video — Full video analysis

analyze_videos — Batch analysis

get_transcript — Transcript only

get_metadata — Metadata only

get_frames — Frames only

analyze_moment — Deep-dive on a time range

get_frame_at — Single frame at a timestamp

get_frame_burst — N frames in a time range

Detail Levels

Caching

Persistent sidecars (resumable bulk processing)

Supported Sources

TwelveLabs Pegasus (optional)

Transcription (Whisper fallback)

Frame Extraction Strategies

Post-Processing Pipeline

Complementary Tools

Chrome DevTools MCP

Example Output

Development

Architecture

License

Install Video Analyzer in Claude Desktop, Claude Code & Cursor

FAQ

Is Video Analyzer MCP free?

Does Video Analyzer need an API key?

Is Video Analyzer hosted or self-hosted?

How do I install Video Analyzer in Claude Desktop, Claude Code or Cursor?

Related MCPs

YouTube

EverArt

gpu-bridge/mcp-server

hamflx/imagen3-mcp

Compare Video Analyzer with

Video Analyzer

About

README

Installation

Prerequisites

Claude Code (CLI)

VS Code / Cursor

Claude Desktop

Verify it works

Tools

analyze_video — Full video analysis

analyze_videos — Batch analysis

get_transcript — Transcript only

get_metadata — Metadata only

get_frames — Frames only

analyze_moment — Deep-dive on a time range

get_frame_at — Single frame at a timestamp

get_frame_burst — N frames in a time range

Detail Levels

Caching

Persistent sidecars (resumable bulk processing)

Supported Sources

TwelveLabs Pegasus (optional)

Transcription (Whisper fallback)

Frame Extraction Strategies

Post-Processing Pipeline

Complementary Tools

Chrome DevTools MCP

Example Output

Development

Architecture

License

Install Video Analyzer in Claude Desktop, Claude Code & Cursor

FAQ

Is Video Analyzer MCP free?

`analyze_video` — Full video analysis

`analyze_videos` — Batch analysis

`get_transcript` — Transcript only

`get_metadata` — Metadata only

`get_frames` — Frames only

`analyze_moment` — Deep-dive on a time range

`get_frame_at` — Single frame at a timestamp

`get_frame_burst` — N frames in a time range

`analyze_video` — Full video analysis

`analyze_videos` — Batch analysis

`get_transcript` — Transcript only

`get_metadata` — Metadata only

`get_frames` — Frames only

`analyze_moment` — Deep-dive on a time range

`get_frame_at` — Single frame at a timestamp

`get_frame_burst` — N frames in a time range