# Depth Estimation in the Wild

Hello all, we at MathWorks, in collaboration with DrivenData, are excited to bring you the data science challenge –Deep Chimpact: Depth Estimation for Wildlife Conservation

Through this challenge, you would get the real-world experience of working with monocular camera video datasets, learn new skills as well as win some prize money online, while working from home. You will also assist the wildlife conservationists in monitoring species population sizes and population change, which helps them protect Earth’s natural resources.

We encourage you to use MATLAB to train your model by providing complimentary MATLAB licenses.

Your goal in this challenge is to automatically estimate the distance between a camera trap and an animal for selected frames in camera trap videos. Automated distance estimation can rapidly accelerate access to population monitoring estimates for conservation. The primary data consists of camera trap videos from two parks in West Africa, along with hand-labelled distances and estimated bounding box coordinates for animals. Check out this page for detailed data and problem overview.

For this 2 blog post series, we will provide you with detailed resources that will help you in solving this challenge. This First blog will talk about monocular depth estimation, working with videos as data set and different approaches you can implement to solve the problem with lowest error. Second upcoming blog will be a detailed starter code for one of the approaches in MATLAB.

# Monocular Depth Estimation

Monocular depth estimation is an inverse problem – given the resulting image we are trying to calculate the arrangement of features that make it up. Further to this, it is an ill-posed problem as there is not a unique solution. For example, a monkey that appears smaller in the image could just be further away, or vice versa. A great example of this is the Ames room:

Traditionally depth estimation is achieved through matching features across multiple viewpoints. For example, structure from motion and stereo vision matching. However, in this example these approaches are not feasible. Whilst the subjects are moving in the video, the displacement between frames is not known and so traditional methods are unlikely to be suitable.

Despite the difficulties outlined above, monocular depth estimation is a widely studied area with increasing successes over recent years. Hopefully, through this challenge, you can continue this progress.

Resources:

Depth Estimation: Basics and Intuition

# Working with Data

Video data

Working with videos is an extension of traditional image processing – a video is simply a stack of images, also called frames, arranged in a specific order. Each individual frame provides spatial information about the scene but considered together, the dynamic nature of video offers an additional, temporal dimension.

The first step to consider is extracting the necessary frames from the videos, making sure to maintain the correct sequencing. These can then be processed before performing machine learning

Resources:

Image processing

## Processing Data

Another challenge in working with videos is the large size of the dataset. In MATLAB, you can use the concept of datastore, to create a repository for collections of data that are too large to fit in memory. A datastore allows you to read and process data stored in multiple files on a disk, a remote location, or a database as a single entity.

Resources:

Understand the concept of datastore: Getting Started with Datastore

Create different datastore for images, text, audio, file etc. Datastore for different File Format or Application

Use built-in datastores directly as input for a deep learning network: Datastores for Deep Learning

Implement a custom datastore for file-based data: Develop Custom Datastore

The data for the challenge will use the data stored in AWS. So, Learn how to access data from S3 bucket

# Getting started

Once the data is ready, the next step is to think about your approach. As with any problem in data science, there are a number of possible avenues to take here. The following paper provides a good starting point for research. It evaluates many of the existing approaches and gives some ideas for further development: Monocular Depth Estimation Based On Deep Learning: An Overview. Further to this, we are providing below some starting pointers for two possible methods.

## Method 1: Optical Flow + CNN

This first approach uses optical flow to detect the animals against the background and then utilizes a pre-trained image classification network to perform regression.

For each labelled frame in the videos, calculate the flow compared to the previous second. Given the animals are moving against a stationary background, the optical flow highlights where they are and provides some context as to their movement. To improve the signal to noise ratio, the provided bounding boxes is used to generate a binary mask for the region of interest. This is used in place of simply cropping the images to retain spatial context.

This generates a new dataset of image frames to serve as input to the next step, training. Here, to simplify training we can use a pre-trained image classification CNN with a couple of small adaptations. Namely, converting the input layers to match our dataset and replacing the last few layers to perform regression down to a single value – the depth estimate.

In this method, whilst each input is considered in isolation, the optical flow itself includes temporal information from the original video.

Resources:

To learn how to implement Optical flow using algorithms Horn-Schunck method, Farneback method and Lucas-Kanade method check out this tutorial video: Computer Vision Training, Motion Estimation

Deep Learning with Images

Introduction to Convolutional Neural Networks

## Method 2: Importing existing models to MATLAB

In their paper “Digging Into Self-Supervised Monocular Depth Estimation”, Godard et al. present a self-supervised model called Monodepth2 for depth estimation off a single image. Their depth prediction network takes a single-color image as input and produces a depth map for the scene.  Additionally, they have provided their pre-trained network in a GitHub repository which we can import to MATLAB and retrain to our new scenario.

In order to import the PyTorch model into MATLAB, it first needs to be exported to the Open Neural Network Exchange (ONNX) format. Fortunately, PyTorch provides a very simple workflow for this process as outlined in this example.

You can also run the python script (“pytorchToOnnx.py” ) we used for testing this approach from this GitHub repo

Once in ONNX format, these can be easily imported into MATLAB using the importONNXNetwork  and importONNXLayers functions.

Resources:

Import Pretrained Deep Learning Networks into MATLAB

Deep Learning Import, Export, and Customization

Get Started with Transfer Learning