I built a tool to draw on screen recordings and hand the frames to Claude

Working with Claude Code on UI bugs is a weird experience. You're looking at a broken layout and you have to describe it with words. "The dropdown is overlapping the modal." "The button in the top right is misaligned." Claude does its best but it's reading a text description of something visual, which is never quite right. You end up going back and forth more than you should.

You can paste screenshots. That helps. But there's no way to circle the thing you mean and say this one, right here. Every time I wanted to point at something specific, I was back to typing coordinates or awkward descriptions.

So I built video-to-claude to fix that.

What it does

You type /vtc in Claude Code. A browser tab opens. You drop a screen recording, scrub to the exact frame you care about, draw on it with arrows and boxes, and hit Send. Claude receives the annotated image as a vision input block, right in the conversation. No copy-paste, no third-party upload, no typing.

The whole thing runs locally. It's a Next.js app wired up as a Claude Code MCP server.

How the MCP side works

Claude Code supports MCP (Model Context Protocol), which is how you give Claude access to custom tools. I wrote a stdio MCP server with two tools:

start_capture_session spins up the Next.js dev server if it's not already running, creates a session, and opens the browser at /capture/{sessionId}. It scans ports 3000 through 3005 first so it doesn't spawn a second process if you already have the app running.

await_capture polls every 1.5 seconds until you click Send in the browser. When you do, it reads the WebP files from disk, base64-encodes them, and returns them as image content blocks back to Claude.

The /vtc slash command calls both tools in sequence, so from the user's side it's just one word.

async function findOurServer(): Promise<string | null> {
  for (const port of SCAN_PORTS) {
    const res = await fetch(`http://localhost:${port}/api/sessions/_ping`, {
      signal: AbortSignal.timeout(1500),
    });
    if (res.ok) {
      const json = await res.json();
      if (json.app === "video-to-claude") return `http://localhost:${port}`;
    }
  }
  return null;
}

The ping check is there so multiple /vtc calls don't stack up extra dev server processes.

The annotation pipeline

When you draw on the canvas and click Capture, here's what actually happens:

ffmpeg seeks to the exact timestamp and pulls one raw frame as PNG
The frontend sends the canvas drawing as an SVG string alongside the frame request
sharp composites the SVG (red arrows, boxes, freehand strokes, text) onto the extracted frame
The result gets encoded as WebP, starting at quality 80 and stepping down by 5 until it's under 2 MB or hits 50
The file lands in data/sessions/{id}/captures/

I was surprised how well sharp handles this. You give it an SVG overlay and it bakes it into the image in under 50ms. No headless browser, no canvas-to-image library. Just sharp.

Handling large files

Screen recordings get big. A 20-minute Loom export is easily a few hundred MB. Single-shot multipart uploads choke on that.

So the upload is chunked: 512 KB pieces, each posted to /api/sessions/{id}/source/chunk, appended to a temp file on the server. When the last chunk lands, a separate finalize call moves the temp file, probes it with ffmpeg, and marks the session as ready. The frontend tracks progress as chunks go through.

This adds maybe 50 lines of code and means the tool works reliably on files of any size. I tested it up to 2 GB. Frame extraction from those big files still comes back in 200-400ms on my machine because ffmpeg does a direct seek rather than decoding from the start.

Why a browser app instead of a CLI

The annotation layer is the whole point. You need a canvas to draw arrows and boxes, you need a video player to scrub through frames, you need thumbnails to review what you've captured. A terminal can't do any of that.

Next.js was the obvious pick here because I wanted server-side ffmpeg and sharp running close to the UI, and the dev server setup is instant. The app stays local so there are no privacy concerns about your screen recordings going anywhere.

Why MCP instead of a CLI flag

MCP lets Claude wait for you. The await_capture tool sits there polling until you finish annotating and click Send. Claude shows a "waiting for frames" message in the conversation and then picks right back up the moment the images arrive. No subprocess coordination, no temp files passed through arguments. It's a clean fit for anything that needs a human in the middle of the loop.

What I cut

The first version had a PySceneDetect pipeline. It would watch a long recording, find scene changes automatically, pull out hundreds of frames, and give you a batch annotation UI. It was a lot of machinery.

I used it a few times and realized I never actually wanted that. I always knew which frame I was looking for. The batch mode just meant more clicking to get to the thing I cared about. So I deleted all of it and rebuilt around the single case that matters: you know the frame, you scrub to it, you annotate it.

The codebase went from about 2000 lines to around 1200. It's a lot more approachable now.

The numbers

Around 50 commits from the initial scaffold to where it is now. Frame extraction runs 200-400ms (ffmpeg seek plus sharp composite) on mid-range hardware. A typical annotated frame at 960px wide and quality 80 comes out 80-200 KB. The MCP server auto-start has a 30-second health check window, which covers slow Windows cold-starts.