Mac Apps

first-pass

What It Does

Web-based video editor that takes raw footage and gives back a clean edit, with two project types you pick at upload time:

Transcription mode — Whisper transcribes with word-level timestamps, Claude marks filler / false starts / double-takes for cutting, you review the strikethroughs, FFmpeg stitches the kept words into an MP4. Treats editing as a text problem.
Visual Search mode — FFmpeg extracts keyframes on scene changes, claude -p describes each one with Vision, you search natural language (“the moment he picks up the pen”), get back start/end timestamps, and export the clip. Treats editing as a search problem.

For long-form interview / monologue cleanup → Transcription mode. Best when most of the footage is talking heads and you want the cuts to come from what was said.
For B-roll, demo, or event footage → Visual Search mode. Best when you cannot describe the shot in words you said, only words you would use to find it.
As a Descript replacement that runs locally on Max-subscription Claude calls instead of paid API tokens.

Backend is FastAPI on Python, frontend is vanilla JS in a dark Descript-style UI. Both modes share the same upload + project-state shell.

Transcription pipeline:

Visual Search pipeline:

Upload video → FFmpeg detects scene changes and samples a keyframe roughly every 10 seconds.
claude -p --model claude-haiku-4-5 describes each frame in batches of 6 with @path image references — captions, subjects, action, setting, camera, tags.
You type a natural-language query. Claude Sonnet picks matching frame-index ranges and returns [{start_time, end_time, reason, confidence}].
Pick a hit, hit Export, FFmpeg re-encodes that range into an MP4 clip.

Repo: ~/apps/first-pass/
Server: python3 server.py → http://localhost:8910
Tech: Python / FastAPI, OpenAI Whisper (local), claude -p (Max subscription, not paid API), FFmpeg, vanilla JS frontend.

File	Purpose
`server.py`	FastAPI app, all API routes for both project types
`transcribe.py`	Whisper transcription with word-level timestamps
`editor.py`	Claude editorial pass via `claude -p`
`exporter.py`	FFmpeg segment stitching for transcription mode
`visual.py`	Scene detection, keyframe extraction, vision indexing, NL search, single-clip export
`static/`	Frontend (`index.html`, `style.css`, `app.js`)
`data/uploads/`	Source video files
`data/projects/`	Per-project JSON state (each has a `type` field)
`data/frames/{project_id}/`	Extracted keyframes for visual projects
`data/exports/`	Exported MP4s

Endpoint	Purpose
`POST /api/upload?project_type=visual`	Upload + create visual project
`POST /api/visual/extract/{project_id}`	Scene detection + floor sampling, extract JPEGs
`POST /api/visual/analyze/{project_id}`	Batches of 6 frames → Claude Vision → JSON tags per frame
`POST /api/visual/search/{project_id}`	`{query}` → matching frame ranges with confidence
`POST /api/visual/export/{project_id}`	`{start_time, end_time, label}` → re-encoded MP4 clip
`GET /api/frames/{project_id}/{filename}`	Serve a keyframe JPEG

Never edit BuildInfo.swift directly — it is auto-generated by the Xcode build phase “Sync Build Number.” To bump the build:

Every code change = bump + build + relaunch. No exceptions.

Word-level granularity is the entire point of transcription mode — every word has a keep / cut state. There is no traditional timeline view; the transcript IS the timeline.
Uses claude -p instead of the paid Anthropic API per James’s Max-subscription rule. If claude -p is broken (auth issues), the editorial pass and visual indexing both stop working — fall back to retrying the auth, not switching to a paid SDK call.
Visual mode keyframe density is roughly 1 frame per 10 seconds plus scene-change spikes. If you need sub-second hit accuracy, this is the wrong tool.