TransKai · YT Transcribe → Translate → Watch
Multi-Lingual Video Intelligence · Brand & On-Screen Detection
AI agent that downloads any YouTube video, auto-detects the language, transcribes with Whisper, translates with Gemini, AND simultaneously watches the video to detect on-screen text, brand mentions, sponsor placements, and speaker turns. Outputs a multi-column spreadsheet for downstream review.
The Brief
Problem
A content review team needed to triage long-form video content in multiple Indian and global languages. Manual transcription took days. Catching on-screen brand placements (e.g. "MobiKwik Pocket UPI", sponsor logos) was completely manual and error-prone.
The Architecture
Decision
Built TransKai — a 4-stage agent pipeline: (1) yt-dlp video download, (2) Whisper transcription with language auto-detect, (3) Gemini translation with timestamp preservation, (4) "AI is watching" parallel agent that detects on-screen text, brand mentions, and speaker turns frame-by-frame. Outputs a structured multi-column XLSX/PDF.
The Outcome
Result
Review throughput dramatically improved. Reviewers now focus on judgment, not transcription. Brand-placement detection that took hours per video now runs automatically as part of the same pipeline.
How it actually works in production.
Acquire
YouTube URL
reviewer submits
Download
yt-dlp
Language detect
any world language
Transcribe & Translate
Whisper transcribe
in source language
Translate → English
Gemini
Contextual explanation
culture · slang · entities
Review
Reviewer dashboard
orig + EN + explanation
Flag / clear
human decision
Animated · Built in code · No GIFs
See TransKai · YT Transcribe → Translate → Watch in action.
TransKai live demo — YouTube URL → download → transcribe → translate → AI watches for brand mentions and on-screen text
Video preview modal — auto language detection, file metadata, translate-from / translate-to picker
Audio extraction stage — preparing for Whisper transcription
Transcription complete — Whisper extracted Hindi speech with timestamps
Translating with Gemini at 85% — "AI is watching" panel detecting on-screen text + speaker turns in parallel
Real-time AI detection of brand mentions, on-screen graphics, and speaker handoffs
Final XLSX output — timestamps, original Hindi, English translation, brand placements (Haier, MobiKwik Pocket UPI, THE LALLANTOP) detected automatically
Stack