N
← All work/06·2025·Government / Content Review Client

TransKai · YT Transcribe → Translate → Watch

Multi-Lingual Video Intelligence · Brand & On-Screen Detection

PRIVATE REPO · NDAGovernment · Content Review

AI agent that downloads any YouTube video, auto-detects the language, transcribes with Whisper, translates with Gemini, AND simultaneously watches the video to detect on-screen text, brand mentions, sponsor placements, and speaker turns. Outputs a multi-column spreadsheet for downstream review.

№ 01Any language → English
№ 02Whisper + Gemini
№ 03On-screen brand detection
№ 04XLSX / PDF export

The Brief

Problem

A content review team needed to triage long-form video content in multiple Indian and global languages. Manual transcription took days. Catching on-screen brand placements (e.g. "MobiKwik Pocket UPI", sponsor logos) was completely manual and error-prone.

The Architecture

Decision

Built TransKai — a 4-stage agent pipeline: (1) yt-dlp video download, (2) Whisper transcription with language auto-detect, (3) Gemini translation with timestamp preservation, (4) "AI is watching" parallel agent that detects on-screen text, brand mentions, and speaker turns frame-by-frame. Outputs a structured multi-column XLSX/PDF.

The Outcome

Result

Review throughput dramatically improved. Reviewers now focus on judgment, not transcription. Brand-placement detection that took hours per video now runs automatically as part of the same pipeline.

The Workflow
animated

How it actually works in production.

01

Acquire

YouTube URL

YouTube URL

reviewer submits

Download

yt-dlp

Language detect

any world language

02

Transcribe & Translate

Whisper transcribe

Whisper transcribe

in source language

Translate → English

Translate → English

Gemini

Contextual explanation

Contextual explanation

culture · slang · entities

03

Review

Reviewer dashboard

orig + EN + explanation

Flag / clear

human decision

Animated · Built in code · No GIFs

Live in production
Visual proof
6 images · 1 video

See TransKai · YT Transcribe → Translate → Watch in action.

TransKai live demo — YouTube URL → download → transcribe → translate → AI watches for brand mentions and on-screen text

~2:25

Video preview modal — auto language detection, file metadata, translate-from / translate-to picker

Audio extraction stage — preparing for Whisper transcription

Transcription complete — Whisper extracted Hindi speech with timestamps

Translating with Gemini at 85% — "AI is watching" panel detecting on-screen text + speaker turns in parallel

Real-time AI detection of brand mentions, on-screen graphics, and speaker handoffs

Final XLSX output — timestamps, original Hindi, English translation, brand placements (Haier, MobiKwik Pocket UPI, THE LALLANTOP) detected automatically

Stack

Built with

PythonWhisperGeminiFFmpegyt-dlpFastAPI