Capabilities

Get a YouTube Transcript

What It Does

Takes a YouTube URL and returns plain-text captions, timestamped by paragraph. A 55-minute video comes back in under a minute when captions are available. When they are not — the uploader disabled captions, or the video is brand new and YouTube has not generated them yet — fall back to downloading the audio and running it through Whisper locally.

Six methods exist. Most of the time you only need the first one. The rest are fallbacks for specific failure modes.

Which Method To Reach For

Situation	Method
Video has auto-captions and you want the text	1. yt-dlp
yt-dlp’s n-challenge breaks that day	2. youtube-transcript-api
Client-side (browser), dodging cloud-IP blocks	3-5. Client-side cascade
Captions disabled, or want transcription independent of YouTube	6. Whisper
Video uploaded less than 45 minutes ago	Wait, or skip straight to Whisper

The Six Methods

1. yt-dlp — Auto-Sub Download (Primary)

Downloads the auto-generated VTT file without touching the video itself. Fastest method for anything on YouTube with captions ready.

cd /tmp && yt-dlp --write-auto-subs --sub-lang en --sub-format vtt \
  --skip-download -o "yt-%(id)s" "https://www.youtube.com/watch?v=VIDEO_ID"

Auto-caption VTT has rolling overlap — each caption block repeats text from the previous one so words slide onto the screen during playback. Naive parsing produces heavy repetition. The cleanup in clean_vtt() at publish-episode.py:238 strips inline tags, de-dupes against a seen set, and groups cues into 60-second paragraphs.

Rate limits: YouTube throttles per IP. A few hundred pulls per day before slowdown. Not an issue from a personal machine; very much an issue from cloud IPs.
Auth: None.
Gotchas: YouTube rolls n-challenges periodically. Keep yt-dlp current with brew upgrade yt-dlp. Cloud IPs (VPS, Railway) hit blocks sooner than residential IPs.

2. youtube-transcript-api — Python Library Fallback

When yt-dlp is broken on a given day, this library is the next reach.

from youtube_transcript_api import YouTubeTranscriptApi
t = YouTubeTranscriptApi.get_transcript("VIDEO_ID")
# list of {text, start, duration}

Standalone HTTP server at youtube-clipseeker/transcript-fetcher.py — curl http://localhost:9876/transcript/VIDEO_ID returns JSON.

Rate limits: Roughly 400-500 requests per IP per day before YouTube starts blocking.
Auth: None.
Gotchas: Same cloud-IP issue as yt-dlp. The library wraps a parser that occasionally breaks when YouTube changes response shapes.

3. Direct timedtext XML

YouTube exposes caption tracks at https://www.youtube.com/api/timedtext?v=VIDEO_ID&lang=en&fmt=srv3. Hit it with fetch or curl, parse the XML.

Used client-side in youtube-clipseeker/src/services/clientTranscript.js so the request comes from the user’s browser IP, dodging cloud blocks entirely.

Gotchas: Sometimes returns empty. Try en, en-US, en-GB, and a.en (auto-generated) in order. CORS blocks pure-browser calls without a proxy (ClipSeeker cycles through corsproxy.io, api.allorigins.win, api.codetabs.com).

4. Page HTML Parsing (ytInitialPlayerResponse)

Fetch the video page HTML, regex-extract the ytInitialPlayerResponse JSON, navigate to captions.playerCaptionsTracklistRenderer.captionTracks[0].baseUrl, then fetch that URL.

Used as a fallback in the same ClipSeeker file above when direct timedtext returns empty.

Gotchas: Regex pattern depends on YouTube’s current page structure. Breaks when they rename or reshape embedded JSON.

5. youtubei Internal POST API

POST https://www.youtube.com/youtubei/v1/get_transcript

Takes a client context and a params token extracted from the video page. Returns a structured transcript response. Most reliable server-side fallback — implemented at youtube-clipseeker/api/transcript/[videoId].js for Vercel Edge Runtime.

Gotchas: Requires extracting valid params from the page first. Client version string in the context must be recent.

6. Whisper — Local Audio Transcription

When captions are disabled or you want a transcript independent of YouTube, download the audio and run it through OpenAI Whisper locally.

yt-dlp -x --audio-format mp3 -o "/tmp/%(id)s.%(ext)s" "https://www.youtube.com/watch?v=VIDEO_ID"
whisper /tmp/VIDEO_ID.mp3 --model base --output_format txt --output_dir /tmp/

Whisper uses ffmpeg under the hood, so it handles anything ffmpeg can decode: MP3, MP4, M4A, WAV, FLAC, OGG, WebM, MOV, AVI, MKV. The audio-extraction step is for efficiency, not necessity — Whisper can read the MP4 directly, it is just wasteful to decode video you do not need.

Models:

base — fastest, fine for long videos. A 55-minute audio file finishes in a few minutes on the Mac Studio GPU.
small / medium — higher accuracy, slower.
large-v3 — highest accuracy, slowest.
Auth: None. Runs locally, no API calls, no rate limits.
Gotchas: Timestamps are word-level rather than caption-block level. Requires ffmpeg installed.

Waiting for Auto-Captions

For long videos (45+ minutes), YouTube takes roughly 30 to 45 minutes after upload to generate auto-captions. Checking too early gets an empty VTT.

Poll until available:

until yt-dlp --list-subs "https://www.youtube.com/watch?v=VIDEO_ID" 2>&1 | grep -q "^en "; do
  echo "Not ready, waiting 5 minutes..."
  sleep 300
done

Or skip the wait entirely and run Whisper on the audio immediately. Same transcript quality, no dependency on YouTube finishing its job.

Who Uses These

tms-ops/publish-episode.py — Uses method 1 (yt-dlp) as the primary transcript step for every TMS episode. Cleans the VTT into timestamped paragraphs with clean_vtt().

youtube-clipseeker — Uses methods 2 through 5 in a cascade: client-side timedtext (browser IP), then client-side HTML parse, then client-side youtubei POST, then server-side youtube-transcript-api as the last resort. The cascade exists because Vercel and Railway IPs get blocked faster than residential connections.

One-off capability — When James says “pull the transcript from this YouTube URL,” methods 1, 2, or 6 all work from a personal machine. Default is method 1.

Known Gaps / TODOs

No automated polling in publish-episode.py. If captions are not ready when cmd_init runs, the VTT lands empty and the pipeline continues without a transcript. Requires manual re-run.
No Whisper fallback in publish-episode.py. When auto-captions fail, the script warns but does not fall back to audio transcription automatically. Would need ffmpeg extraction and a Whisper subprocess call.
Rate-limit detection is shallow. youtube-clipseeker checks for error strings like "IP", "blocked", "too many requests" but does not back off intelligently across repeated failures.

Create a Blog Post — consumes transcripts to generate articles.
Public-facing version of this page: Three Ways to Pull a Transcript from Any YouTube Video — same methods, teaching tone, no internal paths.