
What if your gameplay footage could narrate itself — like a real commentator watching live?
That’s exactly what this project does. I built a pipeline that takes any gameplay video and automatically generates AI-powered commentary, complete with a human-sounding voiceover, synced to the video.
I first tested it on an Age of Empires II gameplay video — the AI picked up on unit movements, battles, and base-building and narrated them like a sports commentator. Watch the original demo here:
The pipeline runs fully sequentially:
ffmpeg slices the video into frames at regular intervalstts-1-hd, voice: Nova) or Voxtral — a fully local TTS that runs on Apple Silicon with no API callsffmpegSending 9 frames as a single combined image rather than 9 separate API calls dramatically reduces cost and latency while still giving the model enough temporal context to understand what is happening in the scene.
The newer version supports Voxtral — an on-device TTS model for Apple Silicon. No extra API key, no cloud calls, fully private. Just swap tts.method: voxtral in config.yaml.
This project was built almost immediately after OpenAI released the GPT-4 Vision API on November 6, 2023 — making it one of the earliest known implementations of a GPT-4o Vision + TTS pipeline applied to real-time gameplay narration.
Prior work on automated commentary existed, but almost exclusively focused on sports — both real-world sports and sports video games. Commentary for non-sports titles (strategy, FPS, etc.) had no equivalent before this project.
What makes this project distinct beyond timing: it combines frame stitching into 3×3 grids, conversation history for context continuity, and audio duration-aware prompting — none of which appeared in the earliest comparable public demos from that period.
uv for dependency managementuv sync
uv run jupyter notebook
Set your OpenAI API key in constants.py, configure config.yaml, and run cells top to bottom. Output lands at experiments/<folder>/output.mp4.
The full source code is available on GitHub:
🐙 github.com/mathewvarghesemanu/Content-based-video-narration-using-deep-learning