Tennis Analysis with AI: Object Detection for Ball Tracking

Posted by Sivylla Paraskevopoulou, October 8, 2025

158 views (last 30 days) | 0 Likes | 0 comment

This blog post is from Cory Hoi, Engineer at MathWorks Engineering Development Group.

In our previous blog post, we used the Video Labeler app to segment both the tennis ball and the court, creating ground truth data for training a neural network. This ground truth forms the foundation for the next step: building and evaluating object detectors.

Pretrained networks offer faster development, require less labeled data, and are well-suited for common tasks. However, they might not generalize well to highly specific applications. Custom networks, on the other hand, offer greater flexibility and control but typically require more effort to design, train, and validate. In this post, we’ll explore both approaches—building a detector from scratch and using a pretrained model—while making use of the labeled ground truth data to improve performance through retraining.

Data Preprocessing

The first step is to organize the labeled data into a format suitable for training. Since the annotations were created using the Video Labeler app, MATLAB makes this process straightforward. We begin by loading the ground truth data and storing the image files using an imageDataStore object. This object efficiently manages large collections of images, supports batch processing, and integrates well with deep learning workflows such as training.

load("trainingData\tennisData\gTruth.mat")

inputSize = [540 960 3];

[imds, pxds] = pixelLabelTrainingData(gTruth,"WriteLocation","folder\to\write\images");

pxds = imageDatastore(pxds.Files);
masks = imageDatastore(pxds.Files);

We split the dataset into training and validation sets using the partitionData function to help prevent overfitting. During training, the network updates its weights to minimize the loss on the training data. If stopping criteria are based on the training set, the model may overfit, that is, it will perform well on seen data but poorly on unseen data. By using the validation set to define stopping criteria, we improve generalization and overall model performance.

The transform function is used to preprocess the input images. After the data is processed we combine the training and validation datastore objects by using the combine function.

[trainingImages, trainingMasks, validationImages, validationMasks] = partitionData(imds, masks);

resizedTrainingImages = transform(trainingImages, @(x) preProcessImages(x, inputSize));
resizedTrainingMasks = transform(masks, @(x) preProcessImages(x, inputSize));

resizedValidationImages = transform(validationImages, @(x) preProcessImages(x, inputSize));
resizedValidationMasks = transform(validationMasks, @(x) preProcessImages(x, inputSize));

dsTraining = combine(resizedTrainingImages, resizedTrainingMasks);
dsValidation = combine(resizedValidationImages, resizedValidationMasks);

Designing Object Detector

To design a simple object detector from scratch, we are using five main building blocks: input layer, downsampling layers, bottleneck layers, upsampling layers, and output layer.

Neural network layout

The downsampling block reduces the spatial dimensions of the input features and only captures larger, higher-level information. This reduces the computational cost, because smaller feature maps mean fewer operations in subsequent layers. The downsampling block in our network consists of a convolutional layer, a batch normalization layer, a ReLU activation function, and a 2-D max pooling layer.

Convolutional neural network structure

The convolutional layer applies learnable kernels to the input by sliding them over local regions of the image, performing element-wise multiplication and summation to generate feature maps that capture spatial patterns. The ReLU activation introduces non-linearity by zeroing out negative values, allowing the network to learn complex features. The max pooling layer reduces spatial dimensions by sliding a 2×2 window over the feature map and keeping only the maximum value from each region, which decreases computation and adds robustness to small spatial shifts. Stride and padding settings in both convolution and pooling control how the filters move and how edges are handled.

Example of the max pooling layer

In the middle of the neural network are the bottleneck layers, which capture and compress the most relevant features from the downsampling block. This encoded information is then passed to the upsampling block through a sequence of transposed convolutional layers, which reconstruct the spatial dimensions. In this architecture, the number of transposed convolutional layers mirrors the number of convolutional layers in the downsampling block, ensuring that the output image matches the input size.

To create the architecture for the described object detector, use the following code.

layers = [
    imageInputLayer(inputSize)

    % Downsampling
    convolution2dLayer(3, 16, 'Padding', 'same', 'Name', 'conv1')
    batchNormalizationLayer
    reluLayer('Name', 'relu1')
    maxPooling2dLayer(2, 'Stride', 2, 'Name', 'maxpool1')

    convolution2dLayer(3, 32, 'Padding', 'same', 'Name', 'conv2')
    batchNormalizationLayer
    reluLayer('Name', 'relu2')
    maxPooling2dLayer(2, 'Stride', 2, 'Name', 'maxpool2')

    % Bottleneck
    convolution2dLayer(3, 64, 'Padding', 'same', 'Name', 'conv2')
    batchNormalizationLayer
    reluLayer('Name', 'relu2')

    % Upsample
    transposedConv2dLayer(3, 32, 'Stride', 2, 'Cropping', 'same', 'Name', 'upsample1')
    reluLayer('Name', 'relu3')

    transposedConv2dLayer(3, 16, 'Stride', 2, 'Cropping', 'same', 'Name', 'upsample2')
    reluLayer('Name', 'relu4')

    % Mask output layer
    convolution2dLayer(1, 1, 'Padding', 'same', 'Name', 'conv3')
    sigmoidLayer('Name', 'softmax')
    ];

Constructing a neural network from scratch involves careful architecture design, layer configuration, and parameter initialization. Before proceeding with training, it’s useful to compare this custom approach with leveraging a pretrained network, which can significantly reduce development time and improve performance on certain tasks.

Pretrained Object Detector

Instead of creating a network from scratch, you can use a pretrained object detector. In MATLAB, there are many available options for pretrained models depending on your task. For this task, we will use the You Only Look Once X (YOLOX) object detector from the Automated Visual Inspection Library for Computer Vision Toolbox support package.

The YOLOX detector is a pretrained neural network that consists of three parts: the backbone, the neck, and the head.

The backbone is a pretrained CNN, trained on the COCO data set. Its purpose is to extract features and compute feature maps from the input images. This is similar to the upsampling block of the detector we designed.
The neck concatenates the feature maps from the backbone and feeds them as inputs into the head at three different scales.
The head outputs classification scores, regression scores, and objectness scores.

Load the pretrained YOLOX-large deep learning network by using the yoloxObjectDetector function. This network has the largest number of filters and convolutional layers, achieving the highest-level accuracy at the expense of computational cost and speed.

networkName = "large-coco";
detector = yoloxObjectDetector(networkName,{'ball'},InputSize=inputSize);

That’s all it takes, just two lines of code to set up an accurate object detector. This object detector has been trained on the COCO data set, and it can recognize 80 object categories. However, tennis balls are not one of these categories. To enable the detector to recognize tennis balls, we will perform transfer learning by retraining the network using our own labeled ground truth data.

Training Object Detectors

To train each neural network, start by defining a set of training options that specify the optimization algorithm, learning rate, and stopping criteria. Once the options are configured, you can train each network by calling the appropriate function.

To train the network that we designed:

net = trainnet(dsTraining, layers, 'binary-crossentropy',options);

To retrain the YOLOX network:

[trainedDetector, info] = trainYOLOXObjectDetector(dsTraining,detector,options);

During training, two sets of outputs are generated allowing you to observe the training progress. The first is displayed in a plot where the blue line represents the loss from the training data set and the orange line represents the loss from the validation data set.

YOLOX training and validation loss vs iteration

The loss can also be displayed at the Command Window.

YOLOX loss at Command Window

Object Detection Performance

After training the networks, you can evaluate their performance by reading an input frame, passing it through the network, and overlaying any detected predictions. To visualize predictions using the YOLOX network, use the following code.

I = read(testingImages);
    
[bboxes,scores,labels] = detect(trainedDetector,I{1},Threshold=0.25);
detectedImg = insertObjectAnnotation(I{1},”Rectangle”,bboxes,labels);

imshow(detectedImg)

YoloxCoryNet predicting the tennis ball’s bounding box

coryNet predicting the tennis ball’s location

One decision we must make about the prediction is the value for the detection threshold. Setting the threshold to a low number (less than 0.5) will detect objects with lower confidence scores. Setting this value higher will result in less detections, but they will be more precise and have a higher likelihood of being a true positive.

To evaluate the overall performance of each network we need to determine how accurately each network is marking the ball. To do so, we can look at the overlapping pixels of the predicted tennis ball and compare those to the overlapping pixels of the ground truth data. For example, see the following image.

Ground truth (green) vs. coryNet prediction (red)

In the above image, the green marker shows the ground truth and the red marker is the coryNet’s prediction. One method for calculating the total error would be to take the absolute value of the difference between the two images. Looping through all images and taking an average value for the network’s accuracy, the network we designed has an accuracy of 66%. While the YOLOX object detector has an accuracy of 87%.

Improving Ball Tracking

In the previous section, we trained a neural network to detect a tennis ball. While neither object detector achieved perfect accuracy, the main objective was to demonstrate the overall workflow rather than to develop a fully optimized model. That said, there are several potential areas for improvement that remain unexplored. One of the key advantages of using MATLAB is the flexibility to go beyond object detection and integrate additional techniques for enhanced performance. We can leverage the object detector and combine it with object tracking in Sensor Fusion and Tracking Toolbox.

Functions in the toolbox can easily be paired with our trained object detector. For a comprehensive tutorial, refer to the Visual Tracking of Occluded and Unresolved Objects example. To use the object tracker with the object detector that we have trained, the detectorObjects must be set to our trainedDetector.

We can then call the function runTracker to setup the individual object tracks. In our case, there will be only a single track since we are only interested in detecting the tennis ball. In this example, we create an object tracker by calling the function trackerGNN.

tracker =  trackerGNN(MaxNumSensors=1,MaxNumTracks=1);

tracker.FilterInitializationFcn = @initcvkf;
tracker.TrackLogic = "History";
tracker.ConfirmationThreshold = [2 2];
tracker.DeletionThreshold = [2 2];

This is more powerful tracking than only using object detection, because we can set additional constraints for an object track. In the above code, the FilterInitializationFcn is set to @initcvk, meaning that the tracker users a constant-velocity unscented Kalman filter.

To run the tracker, call the following function.

frames = runTracker(vidReader, tracker, detectionHistory);

Improving object detection with tracker

The tracker properties defined here use relatively strict thresholds for confirming and deleting tracks. While this can improve precision by reducing false positives, it also increases the risk of discarding valid tracks. Ideally, we want to establish a track and maintain it as long as it remains accurate. This becomes more challenging in cases like a tennis ball, where the trajectory is curved and fast-moving.

Discussion

This blog post only scratches the surface. MATLAB offers a wide range of capabilities for designing, training, and deploying neural networks. Whether you’re interested in building custom models or working with pretrained detectors, this is a good starting point for deeper exploration. If you’re looking to further improve the performance of the neural networks covered here, consider experimenting with the following:

Use Experiment Manager to systematically train and compare models under different conditions. Try varying the solver, learning rate, or mini-batch size to see which combination yields the best results.
Modify the YOLOX starting network to evaluate how different backbone architectures affect detection accuracy and training speed.
Retrain on a new dataset, perhaps for another sport, using the pretrained model. Assess how well the network generalizes and what adjustments may be needed.