Skip to content

Mac Apps

first-pass

Web-based video editor that takes raw footage and gives back a clean edit, with two project types you pick at upload time:

  • Transcription mode — Whisper transcribes with word-level timestamps, Claude marks filler / false starts / double-takes for cutting, you review the strikethroughs, FFmpeg stitches the kept words into an MP4. Treats editing as a text problem.
  • Visual Search mode — FFmpeg extracts keyframes on scene changes, claude -p describes each one with Vision, you search natural language (“the moment he picks up the pen”), get back start/end timestamps, and export the clip. Treats editing as a search problem.
  • For long-form interview / monologue cleanup → Transcription mode. Best when most of the footage is talking heads and you want the cuts to come from what was said.
  • For B-roll, demo, or event footage → Visual Search mode. Best when you cannot describe the shot in words you said, only words you would use to find it.
  • As a Descript replacement that runs locally on Max-subscription Claude calls instead of paid API tokens.

Backend is FastAPI on Python, frontend is vanilla JS in a dark Descript-style UI. Both modes share the same upload + project-state shell.

Transcription pipeline:

  1. Upload video → Whisper transcribes with word-level timestamps.
  2. claude -p reads the transcript and marks index ranges to cut, with reasons.
  3. UI shows the transcript with strikethroughs you can override.
  4. FFmpeg segment-stitches the kept words into an export MP4.

Visual Search pipeline:

  1. Upload video → FFmpeg detects scene changes and samples a keyframe roughly every 10 seconds.
  2. claude -p --model claude-haiku-4-5 describes each frame in batches of 6 with @path image references — captions, subjects, action, setting, camera, tags.
  3. You type a natural-language query. Claude Sonnet picks matching frame-index ranges and returns [{start_time, end_time, reason, confidence}].
  4. Pick a hit, hit Export, FFmpeg re-encodes that range into an MP4 clip.
  • Repo: ~/apps/first-pass/
  • Server: python3 server.pyhttp://localhost:8910
  • Tech: Python / FastAPI, OpenAI Whisper (local), claude -p (Max subscription, not paid API), FFmpeg, vanilla JS frontend.
FilePurpose
server.pyFastAPI app, all API routes for both project types
transcribe.pyWhisper transcription with word-level timestamps
editor.pyClaude editorial pass via claude -p
exporter.pyFFmpeg segment stitching for transcription mode
visual.pyScene detection, keyframe extraction, vision indexing, NL search, single-clip export
static/Frontend (index.html, style.css, app.js)
data/uploads/Source video files
data/projects/Per-project JSON state (each has a type field)
data/frames/{project_id}/Extracted keyframes for visual projects
data/exports/Exported MP4s
EndpointPurpose
POST /api/upload?project_type=visualUpload + create visual project
POST /api/visual/extract/{project_id}Scene detection + floor sampling, extract JPEGs
POST /api/visual/analyze/{project_id}Batches of 6 frames → Claude Vision → JSON tags per frame
POST /api/visual/search/{project_id}{query} → matching frame ranges with confidence
POST /api/visual/export/{project_id}{start_time, end_time, label} → re-encoded MP4 clip
GET /api/frames/{project_id}/{filename}Serve a keyframe JPEG

Never edit BuildInfo.swift directly — it is auto-generated by the Xcode build phase “Sync Build Number.” To bump the build:

  1. echo N > build.txt && echo "summary" > build-summary.txt
  2. Kill the running app, rebuild with xcodebuild, relaunch from DerivedData
  3. Announce “Build N: summary”

Every code change = bump + build + relaunch. No exceptions.

  • Word-level granularity is the entire point of transcription mode — every word has a keep / cut state. There is no traditional timeline view; the transcript IS the timeline.
  • Uses claude -p instead of the paid Anthropic API per James’s Max-subscription rule. If claude -p is broken (auth issues), the editorial pass and visual indexing both stop working — fall back to retrying the auth, not switching to a paid SDK call.
  • Visual mode keyframe density is roughly 1 frame per 10 seconds plus scene-change spikes. If you need sub-second hit accuracy, this is the wrong tool.