Mac Apps
first-pass
What It Does
Section titled “What It Does”Web-based video editor that takes raw footage and gives back a clean edit, with two project types you pick at upload time:
- Transcription mode — Whisper transcribes with word-level timestamps, Claude marks filler / false starts / double-takes for cutting, you review the strikethroughs, FFmpeg stitches the kept words into an MP4. Treats editing as a text problem.
- Visual Search mode — FFmpeg extracts keyframes on scene changes,
claude -pdescribes each one with Vision, you search natural language (“the moment he picks up the pen”), get back start/end timestamps, and export the clip. Treats editing as a search problem.
When To Use
Section titled “When To Use”- For long-form interview / monologue cleanup → Transcription mode. Best when most of the footage is talking heads and you want the cuts to come from what was said.
- For B-roll, demo, or event footage → Visual Search mode. Best when you cannot describe the shot in words you said, only words you would use to find it.
- As a Descript replacement that runs locally on Max-subscription Claude calls instead of paid API tokens.
How It Works
Section titled “How It Works”Backend is FastAPI on Python, frontend is vanilla JS in a dark Descript-style UI. Both modes share the same upload + project-state shell.
Transcription pipeline:
- Upload video → Whisper transcribes with word-level timestamps.
claude -preads the transcript and marks index ranges to cut, with reasons.- UI shows the transcript with strikethroughs you can override.
- FFmpeg segment-stitches the kept words into an export MP4.
Visual Search pipeline:
- Upload video → FFmpeg detects scene changes and samples a keyframe roughly every 10 seconds.
claude -p --model claude-haiku-4-5describes each frame in batches of 6 with@pathimage references — captions, subjects, action, setting, camera, tags.- You type a natural-language query. Claude Sonnet picks matching frame-index ranges and returns
[{start_time, end_time, reason, confidence}]. - Pick a hit, hit Export, FFmpeg re-encodes that range into an MP4 clip.
Where It Lives
Section titled “Where It Lives”- Repo:
~/apps/first-pass/ - Server:
python3 server.py→http://localhost:8910 - Tech: Python / FastAPI, OpenAI Whisper (local),
claude -p(Max subscription, not paid API), FFmpeg, vanilla JS frontend.
Architecture
Section titled “Architecture”| File | Purpose |
|---|---|
server.py | FastAPI app, all API routes for both project types |
transcribe.py | Whisper transcription with word-level timestamps |
editor.py | Claude editorial pass via claude -p |
exporter.py | FFmpeg segment stitching for transcription mode |
visual.py | Scene detection, keyframe extraction, vision indexing, NL search, single-clip export |
static/ | Frontend (index.html, style.css, app.js) |
data/uploads/ | Source video files |
data/projects/ | Per-project JSON state (each has a type field) |
data/frames/{project_id}/ | Extracted keyframes for visual projects |
data/exports/ | Exported MP4s |
Visual Mode Endpoints
Section titled “Visual Mode Endpoints”| Endpoint | Purpose |
|---|---|
POST /api/upload?project_type=visual | Upload + create visual project |
POST /api/visual/extract/{project_id} | Scene detection + floor sampling, extract JPEGs |
POST /api/visual/analyze/{project_id} | Batches of 6 frames → Claude Vision → JSON tags per frame |
POST /api/visual/search/{project_id} | {query} → matching frame ranges with confidence |
POST /api/visual/export/{project_id} | {start_time, end_time, label} → re-encoded MP4 clip |
GET /api/frames/{project_id}/{filename} | Serve a keyframe JPEG |
Build Number Convention
Section titled “Build Number Convention”Never edit BuildInfo.swift directly — it is auto-generated by the Xcode build phase “Sync Build Number.” To bump the build:
echo N > build.txt && echo "summary" > build-summary.txt- Kill the running app, rebuild with xcodebuild, relaunch from DerivedData
- Announce “Build N: summary”
Every code change = bump + build + relaunch. No exceptions.
Known Gaps / Quirks
Section titled “Known Gaps / Quirks”- Word-level granularity is the entire point of transcription mode — every word has a keep / cut state. There is no traditional timeline view; the transcript IS the timeline.
- Uses
claude -pinstead of the paid Anthropic API per James’s Max-subscription rule. Ifclaude -pis broken (auth issues), the editorial pass and visual indexing both stop working — fall back to retrying the auth, not switching to a paid SDK call. - Visual mode keyframe density is roughly 1 frame per 10 seconds plus scene-change spikes. If you need sub-second hit accuracy, this is the wrong tool.