I Labeled 1,000 IPL Broadcast Frames So You Can Train Your Own Cricket AI

The business of the IPL 2026: Cricket's billion-dollar juggernaut enters a  new commercial era - Spo
Dataset: 1,005 IPL broadcast images, 8×8 grid team annotations, 10 IPL teams, 793 train / 212 test. Free on Kaggle: kaggle.com/datasets/goyaljai0207/ipl-player-detection-iitb-pml

This started as a Practical Machine Learning project at IIT Bombay. The brief was to work with real data and build something around probabilistic ML. I could have grabbed a Kaggle starter dataset and been done in an hour. Instead I spent two weekends making something I actually wanted to exist.

The problem with cricket computer vision datasets: the good ones are proprietary — broadcasters and analytics companies own them and don’t share. The public ones tend to be small (200–300 images), curated from press photos rather than actual match footage, and labeled inconsistently. If you want to train something real on IPL broadcast data, there wasn’t much to work with.

What I built

1,005 IPL broadcast images at 800×600px. Actual match footage — real lighting variation, motion blur, partial occlusions, broadcast overlays, crowd in the background. Not press photos. 793 for training, 212 for test.

Each image is annotated with an 8×8 grid — 64 cells per image, each labeled 0 (empty) or 1–10 (one of the 10 IPL teams: CSK, DC, GT, KKR, LSG, MI, PBKS, RR, RCB, SRH). Plus a player count per image ranging from 0 to 20.

Why a grid instead of bounding boxes?

Bounding boxes per player are the obvious annotation choice, but they break down quickly with broadcast footage. Players overlap, get partially cut by the frame edge, are occluded by each other in field shots. The annotation quality becomes inconsistent across annotators and ambiguous at the edges.

The grid approach is coarser — you’re labeling spatial regions, not individual players — but it’s consistent. Two annotators labeling the same frame will agree on grid cells even when they’d disagree on where exactly a bounding box should go. For spatial distribution modeling, it’s actually more useful: you can ask “where are MI players concentrated relative to RCB players in this formation?” which bounding boxes don’t naturally answer.

What you can do with it

  • Multi-label classification: which teams are present in this frame?
  • Player count regression: how many players visible in this shot?
  • Spatial distribution: where on the field does each team tend to cluster?
  • Formation detection: can you identify fielding patterns from the grid?
  • Broadcast shot classification: close-up vs. wide-angle vs. aerial

I trained a basic CNN on it for the coursework and got reasonable accuracy on team identification from a single frame. The dataset is not enormous, but it’s real broadcast footage labeled consistently.

Getting the data

Free on Kaggle — no account required to download. The annotations.csv has one row per image with columns for filename, train/test split, player count, and all 64 grid labels (c01–c64 in row-major order).

If you use it for a project — notebook, paper, weekend experiment — I’d like to know. Dataset at kaggle.com/datasets/goyaljai0207/ipl-player-detection-iitb-pml.

Frequently Asked Questions

What is the IPL player detection dataset?

It’s 1,005 IPL broadcast images at 800x600px, each annotated with an 8×8 grid of team labels (10 IPL teams: CSK, DC, GT, KKR, LSG, MI, PBKS, RR, RCB, SRH) and a player count. 793 training, 212 test. Free on Kaggle.

Why use a grid annotation instead of bounding boxes?

Bounding boxes become ambiguous when players overlap or are partially occluded, which is common in broadcast footage. Grid cells are consistent — two annotators always agree on the same cell, even when they’d disagree on exact bounding box edges.

What ML tasks can I do with this dataset?

Multi-label team classification, player count regression, spatial distribution modeling, formation detection, and broadcast shot classification. The grid annotations support spatial questions that bounding-box datasets don’t naturally answer.

How do I download the IPL dataset?

It’s free on Kaggle at kaggle.com/datasets/goyaljai0207/ipl-player-detection-iitb-pml — no competition entry required. Download annotations.csv plus the images folder to get started.

Find more of my work:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Share