AI-powered game commentary pipeline — GPT-4o Vision + TTS

Content-Based Video Narration Using Deep Learning

What if your gameplay footage could narrate itself — like a real commentator watching live?

That’s exactly what this project does. I built a pipeline that takes any gameplay video and automatically generates AI-powered commentary, complete with a human-sounding voiceover, synced to the video.

I first tested it on an Age of Empires II gameplay video — the AI picked up on unit movements, battles, and base-building and narrated them like a sports commentator. Watch the original demo here:

🎮 See it on LinkedIn


How It Works

The pipeline runs fully sequentially:

  1. Frame extractionffmpeg slices the video into frames at regular intervals
  2. Frame stitching — Up to 9 consecutive frames are combined into a 3×3 grid image using Pillow
  3. Vision analysis — The grid is sent to OpenAI GPT-4o, which generates commentary. Previous narrations are passed as conversation history so it does not repeat itself
  4. Text-to-Speech — Commentary is converted to natural speech via OpenAI TTS (tts-1-hd, voice: Nova) or Voxtral — a fully local TTS that runs on Apple Silicon with no API calls
  5. Audio sync — Each clip is speed-adjusted or padded with silence to match the segment duration exactly
  6. Final merge — All clips are concatenated and merged back into the video with ffmpeg

Why 3×3 Grid Images?

Sending 9 frames as a single combined image rather than 9 separate API calls dramatically reduces cost and latency while still giving the model enough temporal context to understand what is happening in the scene.


Local TTS with Voxtral

The newer version supports Voxtral — an on-device TTS model for Apple Silicon. No extra API key, no cloud calls, fully private. Just swap tts.method: voxtral in config.yaml.


Pioneering AI Game Commentary: Where This Stands

This project was built almost immediately after OpenAI released the GPT-4 Vision API on November 6, 2023 — making it one of the earliest known implementations of a GPT-4o Vision + TTS pipeline applied to real-time gameplay narration.

Prior work on automated commentary existed, but almost exclusively focused on sports — both real-world sports and sports video games. Commentary for non-sports titles (strategy, FPS, etc.) had no equivalent before this project.

  • [Sports] IBM Research (2022–2023) — Auto-commentary for tennis matches using structured event data and generative AI, not raw video frames (IBM Research)
  • [Sports] DeepGameAI / Chintan Trivedi (2021) — End-to-end transformer commentary for football, trained on domain-specific sports datasets — not a general vision-language model (Medium)
  • [General Video] OpenAI Cookbook (2024) — OpenAI’s own example of video narration with GPT-4.1-mini + TTS was published after this project and is not game-specific (OpenAI Cookbook)
  • [Survey] Academic survey (2025) — A comprehensive arXiv survey “From Multimodal Perception to Strategic Reasoning” covers both sports and video game commentary as separate subfields — and identifies the GPT-4 Vision + TTS pipeline used here as the modern standard (arXiv)

What makes this project distinct beyond timing: it combines frame stitching into 3×3 grids, conversation history for context continuity, and audio duration-aware prompting — none of which appeared in the earliest comparable public demos from that period.


Tech Stack

  • Python + Jupyter Notebook
  • OpenAI GPT-4o Vision API
  • OpenAI TTS / Voxtral (local, Apple Silicon)
  • ffmpeg, Pillow, pydub
  • uv for dependency management

Try It Yourself

uv sync
uv run jupyter notebook

Set your OpenAI API key in constants.py, configure config.yaml, and run cells top to bottom. Output lands at experiments/<folder>/output.mp4.


Source Code

The full source code is available on GitHub:

🐙 github.com/mathewvarghesemanu/Content-based-video-narration-using-deep-learning