Capabilities
Get a YouTube Transcript
What It Does
Section titled “What It Does”Takes a YouTube URL and returns plain-text captions, timestamped by paragraph. A 55-minute video comes back in under a minute when captions are available. When they are not — the uploader disabled captions, or the video is brand new and YouTube has not generated them yet — fall back to downloading the audio and running it through Whisper locally.
Six methods exist. Most of the time you only need the first one. The rest are fallbacks for specific failure modes.
Which Method To Reach For
Section titled “Which Method To Reach For”| Situation | Method |
|---|---|
| Video has auto-captions and you want the text | 1. yt-dlp |
| yt-dlp’s n-challenge breaks that day | 2. youtube-transcript-api |
| Client-side (browser), dodging cloud-IP blocks | 3-5. Client-side cascade |
| Captions disabled, or want transcription independent of YouTube | 6. Whisper |
| Video uploaded less than 45 minutes ago | Wait, or skip straight to Whisper |
The Six Methods
Section titled “The Six Methods”1. yt-dlp — Auto-Sub Download (Primary)
Section titled “1. yt-dlp — Auto-Sub Download (Primary)”Downloads the auto-generated VTT file without touching the video itself. Fastest method for anything on YouTube with captions ready.
cd /tmp && yt-dlp --write-auto-subs --sub-lang en --sub-format vtt \ --skip-download -o "yt-%(id)s" "https://www.youtube.com/watch?v=VIDEO_ID"Auto-caption VTT has rolling overlap — each caption block repeats text from the previous one so words slide onto the screen during playback. Naive parsing produces heavy repetition. The cleanup in clean_vtt() at publish-episode.py:238 strips inline tags, de-dupes against a seen set, and groups cues into 60-second paragraphs.
- Rate limits: YouTube throttles per IP. A few hundred pulls per day before slowdown. Not an issue from a personal machine; very much an issue from cloud IPs.
- Auth: None.
- Gotchas: YouTube rolls n-challenges periodically. Keep yt-dlp current with
brew upgrade yt-dlp. Cloud IPs (VPS, Railway) hit blocks sooner than residential IPs.
2. youtube-transcript-api — Python Library Fallback
Section titled “2. youtube-transcript-api — Python Library Fallback”When yt-dlp is broken on a given day, this library is the next reach.
from youtube_transcript_api import YouTubeTranscriptApit = YouTubeTranscriptApi.get_transcript("VIDEO_ID")# list of {text, start, duration}Standalone HTTP server at youtube-clipseeker/transcript-fetcher.py — curl http://localhost:9876/transcript/VIDEO_ID returns JSON.
- Rate limits: Roughly 400-500 requests per IP per day before YouTube starts blocking.
- Auth: None.
- Gotchas: Same cloud-IP issue as yt-dlp. The library wraps a parser that occasionally breaks when YouTube changes response shapes.
3. Direct timedtext XML
Section titled “3. Direct timedtext XML”YouTube exposes caption tracks at https://www.youtube.com/api/timedtext?v=VIDEO_ID&lang=en&fmt=srv3. Hit it with fetch or curl, parse the XML.
Used client-side in youtube-clipseeker/src/services/clientTranscript.js so the request comes from the user’s browser IP, dodging cloud blocks entirely.
- Gotchas: Sometimes returns empty. Try
en,en-US,en-GB, anda.en(auto-generated) in order. CORS blocks pure-browser calls without a proxy (ClipSeeker cycles throughcorsproxy.io,api.allorigins.win,api.codetabs.com).
4. Page HTML Parsing (ytInitialPlayerResponse)
Section titled “4. Page HTML Parsing (ytInitialPlayerResponse)”Fetch the video page HTML, regex-extract the ytInitialPlayerResponse JSON, navigate to captions.playerCaptionsTracklistRenderer.captionTracks[0].baseUrl, then fetch that URL.
Used as a fallback in the same ClipSeeker file above when direct timedtext returns empty.
- Gotchas: Regex pattern depends on YouTube’s current page structure. Breaks when they rename or reshape embedded JSON.
5. youtubei Internal POST API
Section titled “5. youtubei Internal POST API”POST https://www.youtube.com/youtubei/v1/get_transcriptTakes a client context and a params token extracted from the video page. Returns a structured transcript response. Most reliable server-side fallback — implemented at youtube-clipseeker/api/transcript/[videoId].js for Vercel Edge Runtime.
- Gotchas: Requires extracting valid
paramsfrom the page first. Client version string in the context must be recent.
6. Whisper — Local Audio Transcription
Section titled “6. Whisper — Local Audio Transcription”When captions are disabled or you want a transcript independent of YouTube, download the audio and run it through OpenAI Whisper locally.
yt-dlp -x --audio-format mp3 -o "/tmp/%(id)s.%(ext)s" "https://www.youtube.com/watch?v=VIDEO_ID"whisper /tmp/VIDEO_ID.mp3 --model base --output_format txt --output_dir /tmp/Whisper uses ffmpeg under the hood, so it handles anything ffmpeg can decode: MP3, MP4, M4A, WAV, FLAC, OGG, WebM, MOV, AVI, MKV. The audio-extraction step is for efficiency, not necessity — Whisper can read the MP4 directly, it is just wasteful to decode video you do not need.
Models:
-
base— fastest, fine for long videos. A 55-minute audio file finishes in a few minutes on the Mac Studio GPU. -
small/medium— higher accuracy, slower. -
large-v3— highest accuracy, slowest. -
Auth: None. Runs locally, no API calls, no rate limits.
-
Gotchas: Timestamps are word-level rather than caption-block level. Requires
ffmpeginstalled.
Waiting for Auto-Captions
Section titled “Waiting for Auto-Captions”For long videos (45+ minutes), YouTube takes roughly 30 to 45 minutes after upload to generate auto-captions. Checking too early gets an empty VTT.
Poll until available:
until yt-dlp --list-subs "https://www.youtube.com/watch?v=VIDEO_ID" 2>&1 | grep -q "^en "; do echo "Not ready, waiting 5 minutes..." sleep 300doneOr skip the wait entirely and run Whisper on the audio immediately. Same transcript quality, no dependency on YouTube finishing its job.
Who Uses These
Section titled “Who Uses These”tms-ops/publish-episode.py — Uses method 1 (yt-dlp) as the primary transcript step for every TMS episode. Cleans the VTT into timestamped paragraphs with clean_vtt().
youtube-clipseeker — Uses methods 2 through 5 in a cascade: client-side timedtext (browser IP), then client-side HTML parse, then client-side youtubei POST, then server-side youtube-transcript-api as the last resort. The cascade exists because Vercel and Railway IPs get blocked faster than residential connections.
One-off capability — When James says “pull the transcript from this YouTube URL,” methods 1, 2, or 6 all work from a personal machine. Default is method 1.
Known Gaps / TODOs
Section titled “Known Gaps / TODOs”- No automated polling in
publish-episode.py. If captions are not ready whencmd_initruns, the VTT lands empty and the pipeline continues without a transcript. Requires manual re-run. - No Whisper fallback in
publish-episode.py. When auto-captions fail, the script warns but does not fall back to audio transcription automatically. Would needffmpegextraction and a Whisper subprocess call. - Rate-limit detection is shallow. youtube-clipseeker checks for error strings like
"IP","blocked","too many requests"but does not back off intelligently across repeated failures.
Related
Section titled “Related”- Create a Blog Post — consumes transcripts to generate articles.
- Public-facing version of this page: Three Ways to Pull a Transcript from Any YouTube Video — same methods, teaching tone, no internal paths.