{"id":51,"date":"2017-11-03T08:00:26","date_gmt":"2017-11-03T08:00:26","guid":{"rendered":"https:\/\/blogs.mathworks.com\/deep-learning\/?p=51"},"modified":"2021-04-06T15:52:44","modified_gmt":"2021-04-06T19:52:44","slug":"deep-learning-for-automated-driving-part-1-vehicle-detection","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/deep-learning\/2017\/11\/03\/deep-learning-for-automated-driving-part-1-vehicle-detection\/","title":{"rendered":"Deep Learning for Automated Driving (Part 1) &#8211; Vehicle Detection"},"content":{"rendered":"This is a guest post from\u00a0<a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/4291457-avi-nehemiah\">Avinash Nehemiah<\/a>, Avi is a product manager for computer vision and automated driving.\r\n\r\n<a href=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/Portrait.jpg\"><img decoding=\"async\" class=\"aligncenter size-thumbnail wp-image-850\" src=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/Portrait-150x150.jpg\" alt=\"\" \/><\/a>\r\n\r\n<span style=\"font-weight: 400\">\u00a0I often get questions from friends and colleagues on how automated driving systems perceive their environment and make \u201chuman-like\u201d decisions and how MATLAB is used in these systems.<\/span>\r\n\r\nOver the next two blog posts I\u2019ll explain how deep learning and MATLAB are used to solve two common perception tasks for automated driving:\r\n<ol>\r\n \t<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Vehicle detection (this post)\u00a0<\/span><\/li>\r\n \t<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Lane detection (next post)\u00a0<\/span><\/li>\r\n<\/ol>\r\n<h2>Vehicle Detection<\/h2>\r\nObject detection is the process of locating and classifying objects in images and video. In this section I\u2019ll use a vehicle detection example to walk you through how to use deep learning to create an object detector. The same steps can be used to create any object detector.\u00a0The figure below shows the output of a three class vehicle detector, where the detector locates \u00a0and classifies each type of vehicle.\r\n\r\n<a href=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/CarDetection.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-823\" src=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/CarDetection.png\" alt=\"\" \/><\/a>\r\n<p style=\"text-align: center\"><i><span style=\"font-weight: 400\">Output of a vehicle detector that locates and classifies different types of vehicles.<\/span><\/i><\/p>\r\n<span style=\"font-weight: 400\">Before I can start creating a vehicle detector I need a set of labeled training data, which is a set of images annotated with the locations and labels of objects of interest. More specifically, someone needs to sift through every image or frame of video and label the locations of all objects of interest. This process is known as\u00a0<\/span><i><span style=\"font-weight: 400\">ground truth labeling<\/span><\/i><span style=\"font-weight: 400\">. Ground truth labeling is often the most time-consuming part of creating an object detector.\u00a0The figure below shows a raw training image on the left, and the same image with the labeled ground truth on the right.<\/span>\r\n\r\n<a href=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/Labeled_images.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-828\" src=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/Labeled_images.png\" alt=\"\" \/><\/a>\r\n<p style=\"text-align: center\"><i><span style=\"font-weight: 400\">Raw input image (left) and input image with labeled ground truth (right).<\/span><\/i><\/p>\r\n<span style=\"font-weight: 400\">As you can imagine, labeling a sufficiently large set of training images can be a laborious and manual process. To reduce the amount of time I spend labeling data, I used the Ground Truth Labeler in\u00a0\u00a0<\/span><a href=\"https:\/\/www.mathworks.com\/products\/automated-driving.html\"><span style=\"font-weight: 400\">Automated Driving System Toolbox<\/span><\/a><span style=\"font-weight: 400\">, which is an app to label ground truth as well as automate part of the labeling process.<\/span>\r\n\r\n<a href=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/GTLApp.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-827\" src=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/GTLApp.png\" alt=\"\" \/><\/a>\r\n<p style=\"text-align: center\"><i><span style=\"font-weight: 400\">Screen shot of Ground Truth Labeler app designed to label video and image data.<\/span><\/i><\/p>\r\n<span style=\"font-weight: 400\">One way to automate part of the process is to use a tracking algorithm. The tracker I use is the<\/span><a href=\"https:\/\/www.mathworks.com\/help\/vision\/ref\/vision.pointtracker-system-object.html\"><span style=\"font-weight: 400\">\u00a0Kanade Lucas Tomasi<\/span><\/a><span style=\"font-weight: 400\">\u00a0algorithm (KLT) which is one of the first computer vision algorithms to be used in real-world applications. The KLT algorithm represents objects as a set of feature points and tracks their movement from frame to frame. This lets us manually label one or more objects in the first frame, and use the tracker to label the rest of the video. The Ground Truth Labeler app also allows users to import their own algorithms to automate labeling. The most common way I\u2019ve seen this feature used is when users import their own existing detectors to label new data, which helps them eventually create more accurate detectors. The figure below illustrates the workflow used to label a sequence of images or a video using the Ground Truth Labeler app.<\/span>\r\n\r\n<a href=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/ground_truth_labeling_workflow.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-831\" src=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/ground_truth_labeling_workflow.png\" alt=\"\" \/><\/a>\r\n<p style=\"text-align: center\"><i><span style=\"font-weight: 400\">Process of automating ground truth labeling using MATLAB.<\/span><\/i><\/p>\r\n<span style=\"font-weight: 400\">The labeled data is stored as a table that lists the location of the vehicles in each time step of video from our training set. With the ground truth labeling complete, I can start training a vehicle detector. In our case I estimate the ground truth labeling process was sped up by up to 119x. The training video data for our video was captured at 30 frames per second, and we labeled objects every 4 seconds. That means we saved the time it would take to label the 119 frames in between. This 119x savings is a best case as we sometimes had to correct the output of the automated labeling.<\/span>\r\n\r\n<span style=\"font-weight: 400\">For our vehicle detector, I use a<\/span><a href=\"https:\/\/www.mathworks.com\/help\/vision\/ref\/fasterrcnnobjectdetector-class.html\">\u00a0<span style=\"font-weight: 400\">Faster R-CNN<\/span><\/a><span style=\"font-weight: 400\">\u00a0network. Let\u2019s start by defining a network architecture as illustrated in the MATLAB code snippets below. The Faster R-CNN algorithm analyzes regions of an image and therefore the input layer is smaller than the expected size of an input image. In our case I choose a 32x32 pixel window. The input size is a balance between execution time and the amount of spatial detail you want the detector to resolve.<\/span>\r\n<pre>% Create image input layer.\r\ninputLayer = imageInputLayer([32 32 3]);<\/pre>\r\n<span style=\"font-weight: 400\">The middle layers are the core building blocks of the network, with repeated sets of\u00a0<\/span><span style=\"font-weight: 400\">convolution, ReLU and pooling\u00a0<\/span><span style=\"font-weight: 400\">layers. For our example, I\u2019ll use just a couple of layers. You can always create a deeper network by repeating these layers to improve accuracy or if you want to incorporate more classes into the detector. You can learn more about the different types of layers available in the Neural Network Toolbox\u00a0<a href=\"https:\/\/www.mathworks.com\/help\/nnet\/deep-learning-basics.html\">documentation<\/a>.\u00a0\u00a0<\/span>\r\n<pre>% Define the convolutional layer parameters.\r\nfilterSize = [3 3];\r\nnumFilters = 32;\r\n\r\n% Create the middle layers.\r\nmiddleLayers = [\r\nconvolution2dLayer(filterSize, numFilters, 'Padding', 1)\r\nreluLayer()\r\nconvolution2dLayer(filterSize, numFilters, 'Padding', 1)\r\nreluLayer()\r\nmaxPooling2dLayer(3, 'Stride',2)\r\n];<\/pre>\r\n<span style=\"font-weight: 400\">The final layers of a CNN are typically a set of fully connected layers and a softmax loss layer. In this case, I\u2019ve added a ReLU nonlinearity between the fully connected layers to improve detector performance since our training set for this detector wasn\u2019t as large as I would like.<\/span>\r\n<pre>finalLayers = [\r\n% Add a fully connected layer with 64 output neurons. The output size\r\n% of this layer will be an array with a length of 64.\r\nfullyConnectedLayer(64)\r\n% Add a ReLU non-linearity.\r\nreluLayer()\r\n% Add the last fully connected layer. At this point, the network must\r\n% produce outputs that can be used to measure whether the input image\r\n% belongs to one of the object classes or background. This measurement\r\n% is made using the subsequent loss layers.\r\nfullyConnectedLayer(width(vehicleDataset))\r\n% Add the softmax loss layer and classification layer.\r\nsoftmaxLayer()\r\nclassificationLayer()\r\n];\r\n\r\nlayers = [\r\ninputLayer\r\nmiddleLayers\r\nfinalLayers\r\n]<\/pre>\r\n<span style=\"font-weight: 400\">\u00a0<\/span>\r\n\r\n<span style=\"font-weight: 400\">To train the object detector, I pass the \"layers\" network structure to the \"<\/span><span style=\"font-weight: 400\"><a href=\"https:\/\/www.mathworks.com\/help\/vision\/ref\/trainfasterrcnnobjectdetector.html\">trainFasterRCNNObjectDetector<\/a>\"<\/span><span style=\"font-weight: 400\">\u00a0function. If you have a GPU installed, the algorithm will default to using the GPU. If you want to train without a GPU or use multiple GPUs, you can do so by adjusting the<\/span><span style=\"font-weight: 400\">\u00a0\"ExecutionEnvironment\"<\/span><span style=\"font-weight: 400\">\u00a0parameter in \"<\/span><span style=\"font-weight: 400\"><a href=\"https:\/\/www.mathworks.com\/help\/nnet\/ref\/trainingoptions.html\">trainingOptions<\/a>\"<\/span><span style=\"font-weight: 400\">.<\/span>\r\n<pre><span style=\"font-weight: 400\">detector = trainFasterRCNNObjectDetector(trainingData, layers, options, ...<\/span>\r\n\u00a0 \u00a0 'NegativeOverlapRange', [0 0.3], ...\r\n<span style=\"font-weight: 400\"> \u00a0\u00a0\u00a0<\/span><span style=\"font-weight: 400\">'PositiveOverlapRange', [0.6 1], ...<\/span>\r\n<span style=\"font-weight: 400\"> \u00a0\u00a0\u00a0<\/span><span style=\"font-weight: 400\">'BoxPyramidScale', 1.2);<\/span><\/pre>\r\n<span style=\"font-weight: 400\">Once training is done, try it out on a few test images to see if the detector is working properly. I used the following code to test the detector on a single image.\u00a0<\/span>\r\n<pre><span style=\"font-weight: 400\">% Read a test image.<\/span>\r\n <span style=\"font-weight: 400\">I = imread('highway.png');<\/span>\r\n \r\n <span style=\"font-weight: 400\">% Run the detector.<\/span>\r\n <span style=\"font-weight: 400\">[bboxes, scores] = detect(detector, I);<\/span>\r\n \r\n <span style=\"font-weight: 400\">% Annotate detections in the image.<\/span>\r\n <span style=\"font-weight: 400\">I = insertObjectAnnotation(I, 'rectangle', bboxes, scores);<\/span>\r\n <span style=\"font-weight: 400\">figure<\/span>\r\n <span style=\"font-weight: 400\">imshow(I)<\/span><\/pre>\r\n<a href=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/DetectionOutput.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-824\" src=\"https:\/\/blogs.mathworks.com\/wp-content\/uploads\/2017\/10\/DetectionOutput.png\" alt=\"\" \/><\/a>\r\n<p style=\"text-align: center\"><i><span style=\"font-weight: 400\">Detected bounding boxes and scores from Faster R-CNN vehicle detector.<\/span><\/i><\/p>\r\n<span style=\"font-weight: 400\">Once you are confident that your detector is working, I highly recommend testing it on a larger set of validation images using a statistical metric such as average precision which provides a single score measure of the ability of the detector to make correct classifications (precision) and the ability of the detector to find all relevant objects (recall).<\/span><a href=\"https:\/\/www.mathworks.com\/help\/vision\/ref\/evaluatedetectionprecision.html\">\u00a0<span style=\"font-weight: 400\">This page<\/span><\/a><span style=\"font-weight: 400\">\u00a0provides more information on how you can evaluate a detector.<\/span>\r\n\r\n<span style=\"font-weight: 400\">To solve the problems described in this post I used MATLAB R2017b along with\u00a0<\/span><a href=\"https:\/\/www.mathworks.com\/products\/neural-network.html\"><span style=\"font-weight: 400\">Neural Network Toolbox<\/span><\/a><span style=\"font-weight: 400\">,\u00a0<\/span><a href=\"https:\/\/www.mathworks.com\/products\/parallel-computing.html\"><span style=\"font-weight: 400\">Parallel Computing Toolbox<\/span><\/a><span style=\"font-weight: 400\">,\u00a0<\/span><a href=\"https:\/\/www.mathworks.com\/products\/computer-vision.html\"><span style=\"font-weight: 400\">Computer Vision System Toolbox<\/span><\/a><span style=\"font-weight: 400\">, and\u00a0<\/span><a href=\"https:\/\/www.mathworks.com\/products\/automated-driving.html\"><span style=\"font-weight: 400\">Automated Driving System Toolbox<\/span><\/a><span style=\"font-weight: 400\">.\u00a0<\/span>","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2017\/11\/GTLApp-1.png\" class=\"img-responsive attachment-post-thumbnail size-post-thumbnail wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" \/><\/div><p>This is a guest post from\u00a0Avinash Nehemiah, Avi is a product manager for computer vision and automated driving.\r\n\r\n\r\n\r\n\u00a0I often get questions from friends and colleagues on how automated driving... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2017\/11\/03\/deep-learning-for-automated-driving-part-1-vehicle-detection\/\">read more >><\/a><\/p>","protected":false},"author":132,"featured_media":62,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/51"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/users\/132"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/comments?post=51"}],"version-history":[{"count":2,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/51\/revisions"}],"predecessor-version":[{"id":55,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/51\/revisions\/55"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media\/62"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media?parent=51"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/categories?post=51"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/tags?post=51"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}