{"id":3821,"date":"2020-03-06T06:07:22","date_gmt":"2020-03-06T06:07:22","guid":{"rendered":"https:\/\/blogs.mathworks.com\/deep-learning\/?p=3821"},"modified":"2021-04-06T15:48:54","modified_gmt":"2021-04-06T19:48:54","slug":"advanced-deep-learning-key-terms","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/deep-learning\/2020\/03\/06\/advanced-deep-learning-key-terms\/","title":{"rendered":"Advanced Deep Learning: Key Terms"},"content":{"rendered":"<h1>Key terms in custom training loops<\/h1>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">In this post, I would like to go into detail on <strong>Loss<\/strong>, <strong>Model Gradients<\/strong>, and <strong>Automatic Differentiation<\/strong><\/span>\r\n<h6><\/h6>\r\n\r\n<table>\r\n<tbody>\r\n<td style=\"border-left: 2px solid #b245ad;padding: 10px;\"><span style=\"font-size: 14px;\">This is Part 2 in a series of Advanced Deep Learning Posts. To read the series, please see the following links:<\/span>\r\n<h6><\/h6>\r\n<ul>\r\n\r\n\t<li>Post 1: <a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2020\/02\/28\/advanced-deep-learning-part-1\/\">Introduction<\/a><\/li>\r\n\r\n\t<li>Post 2: Custom Training: Key Terms (This post!)<\/li>\r\n<\/ul>\r\n<\/td>\r\n<\/tbody>\r\n<\/table>\r\n\r\n<h6><\/h6>\r\n\r\n\r\n<span style=\"font-size: 14px;\">In Part 1, we left off talking about the custom training loop that you need to write in order to tap into the power of the extended framework. If you have a simple network, it\u2019s likely <span style=\"font-family:courier;\">TrainNetwork<\/span> will do the trick. For everything else, we can write the training loop ourselves. <\/span>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">At a high level, the training loop looks something like this:<\/span>\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" width=\"500\" height=\"233\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/trainingLoopVisual_resized.png\" alt=\"\" class=\"alignnone size-full wp-image-3903\" \/>\r\n<h6><\/h6>\r\n\r\n<span style=\"font-size: 15px; color: #f27d1d;\"><strong>Key Steps in the Loop<\/strong><\/span>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">I want to take a very simple problem to highlight the important parts of this loop, focusing on the non-optional portions in our diagram above. Our model has 2 learnable parameters, x1 and x2 and our goal is to optimize these parameters such that the output of our function is 0: <\/span>\r\n\r\n<pre>y = (x2 - x1.^2).^2 + (1 - x1).^2;<\/pre>\r\n<span style=\"font-size: 14px;\">We will optimize our model till y = 0; This is a fairly classic equation (Rosenbrock) used in statistics problems. (Hint, there is a solution at x1 = 1 and x2 = 1). <\/span>\r\n<h6><\/h6>\r\n\r\n<span style=\"font-size: 14px;\">We will start by guessing at the optimal solution x1 = 2, x2 = 2. This is just a starting point, and this code will show how we change these parameters to improve our model and eventually arrive at a solution. <\/span>\r\n\r\n\r\n<h6><\/h6>\r\n\r\n<pre><span class=\"comment\">% Define the learnable parameters, with initial guess 2,2<\/span>\r\nmy_x1 = 2;\r\nmy_x2 = 2;\r\n\r\nlearn_rate = 0.1;\r\n\r\n<span class=\"comment\">% Setup (convert to dlarray) <\/span>\r\nx1 = dlarray(my_x1);\r\nx2 = dlarray(my_x2);\r\n\r\n<span class=\"comment\">% Call dlfeval which uses the function my_loss to calculate model gradients & loss  <\/span>\r\n[loss,dydx1,dydx2] = dlfeval(@my_loss,x1,x2);\r\n\r\n<span class=\"comment\">% update the model  <\/span>\r\n[new_x1,new_x2] = updateModel(x1,x2,dydx1,dydx2,learn_rate);\r\n\r\n<span class=\"comment\">% plot our current values  <\/span>\r\nplot(extractdata(new_x1),extractdata(new_x2),'rx');<\/pre>\r\n\r\n<h6><\/h6>\r\n<em>*Do you see the correlation between this loop and deep learning? Our equation or \u201cmodel\u201d is much simpler, but the concepts are the same. We have \u201clearnables\u201d in deep learning such as the weights and biases. Here we have two learnable parameters here (x1 and x2). Changing the learnables in the training is what increases the accuracy of the model over time. The following training loop does the same thing as deep learning training, just a bit simpler to understand.<\/em>\r\n<h6><\/h6>\r\nLet's walk through all the steps of this training loop, and point out key terms along the way. \r\n<h6><\/h6>\r\n\r\n<h3>1. Setup: Read data, convert to <span style=\"font-family:courier;\">dlarray<\/span><\/h3>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\"><span style=\"font-family:courier;\">dlarray<\/span> is the structure designed to contain deep learning problems. This means for any \"dl\" functions (<span style=\"font-family:courier;\">dlfeval, dlgradient<\/span>) to work, we need to convert our data to a <span style=\"font-family:courier;\">dlarray<\/span> and then convert back using <span style=\"font-family:courier;\">extractdata<\/span>.<\/span>\r\n<h6><\/h6>\r\n\r\n<h3>2. Calculate model gradients and loss<\/h3>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">This happens in the function called <span style=\"font-family:courier;\">my_loss<\/span>:<\/span>\r\n<pre>function [y,dydx1,dydx2] = my_loss(x1,x2)\r\n  y = (x2 - x1.^2).^2 + (1 - x1).^2; <span class=\"comment\">% calculates the loss (or how close to zero) <\/span>\r\n  [dydx1,dydx2] = dlgradient(y,x1,x2); <span class=\"comment\">% this calculates the derivatives<\/span>\r\nend<\/pre>\r\n\r\n<span style=\"font-size: 14px;\">Plugging in our values for x2 and x1, we get:<\/span>\r\n\r\n<a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/initialyvalue.png\"><img decoding=\"async\" loading=\"lazy\" width=\"438\" height=\"104\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/initialyvalue.png\" alt=\"\" class=\"alignnone size-full wp-image-3837\" \/><\/a>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">This isn\u2019t 0, which means we must find a new guess. The <strong>loss<\/strong> is the error, which we can keep track of to understand how far away we are from a good answer. Using <strong>gradient descent<\/strong>, a popular method to strategically update parameters, we calculate the gradient\/derivative and then move in the direction opposite the slope.<\/span>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">To better visualize this, let's look at the quiver plot:<\/span>\r\n\r\n<a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/quiver1.png\"><img decoding=\"async\" loading=\"lazy\" width=\"560\" height=\"420\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/quiver1.png\" alt=\"\" class=\"alignnone size-full wp-image-3875\" \/><\/a>\r\n<h6><\/h6>\r\n\r\n<span style=\"font-size: 14px;\">By visualizing the gradient at different points, we can see that by following the gradient, we can eventually find our way to the correct location on the plot.<\/span>\r\n\r\n<span style=\"font-size: 14px;\">The documentation for <a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ref\/dlfeval.html\">dlfeval<\/a> does a great job explaining how dlgradient works, which is where I stole the quiver plot idea.<\/span>\r\n\r\n\r\n<h6><\/h6> \r\n<span style=\"font-size: 14px;\">Calculate the gradient using <span style=\"font-family:courier;\">dlgradient<\/span> using <strong>automatic differentiation<\/strong>. This is the moment I realized <strong>automatic differentiation simply means this function is going to automatically differentiate the model, and tell us the model gradients.<\/strong> It's not as scary as I thought.<\/span>\r\n<h6><\/h6>\r\n\r\n<span style=\"font-size: 14px;\">So from our function <span style=\"font-family:courier;\">my_loss<\/span>, we get the <strong>model gradients<\/strong> and <strong>loss<\/strong>, which are used to calculate new learnable parameters to improve the model.<\/span>\r\n<h6><\/h6>\r\n<h3>3. Update the model<\/h3>\r\n<span style=\"font-size: 14px;\">In our loop, the next step is to update model using the gradient to find new model parameters. <\/span>\r\n\r\n<pre><span class=\"comment\">% update the model  <\/span>\r\n[new_x1,new_x2] = updateModel(x1,x2,dydx1,dydx2,learn_rate);<\/pre>\r\n\r\n<span style=\"font-size: 14px;\">Using gradient descent, we will update these parameters, moving in the opposite direction of the slope:<\/span>\r\n\r\n<pre>\r\nfunction [new_x1,new_x2] = updateModel(x1,x2,dydx1,dydx2,learn_rate);\r\n  newx1 = x1 + -dydx1*learn_rate;\r\n  newx2 = x2 + -dydx2*learn_rate;\r\nend\r\n<\/pre>\r\n<h6><\/h6>\r\n<em>Please note, this is where this example and deep learning differ: <strong>you are not responsible for determining new learnables<\/strong>. This is done in the real training loop through the optimizer, ADAM or SGDM or whichever optimizer you choose. In deep learning, replace this updateModel function with SGDMUpdate or ADAMUpdate <\/em>\r\n\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">The <strong>learning rate<\/strong> is how quickly you want to move in a certain direction. Higher means faster or larger leaps, but we\u2019ll see in a second this has strong implications on how your training is going to go.<\/span>\r\n<h6><\/h6>\r\n<span style=\"font-size: 15px; color: #f27d1d;\"><strong>Training Loop<\/strong><\/span>\r\n<h6><\/h6>\r\n\r\n<pre><span class=\"comment\">% Define the learnable parameters, with initial guess 2,2<\/span>\r\nmy_x1 = 2;\r\nmy_x2 = 2;\r\n\r\nlearn_rate = 0.1;\r\n\r\n<span class=\"comment\">% Setup (convert to dlarray) <\/span>\r\nx1 = dlarray(my_x1);\r\nx2 = dlarray(my_x2);\r\n\r\n<span class=\"comment\">% loop starts here:<\/span>\r\nfor ii = 1:100\r\n    \r\n    <span class=\"comment\">% Call dlfeval which uses the function my_loss to calculate model gradients and loss<\/span>\r\n    [loss,dydx1,dydx2] = dlfeval(@my_loss,x1,x2);\r\n    \r\n    <span class=\"comment\">% update the model<\/span>\r\n    [x1,x2] = updateModel(x1,x2,dydx1,dydx2,learn_rate);\r\n        \r\n   <span class=\"comment\"> % plot our current values <\/span>\r\n    plot(extractdata(x1),extractdata(x2),'bx');\r\nend<\/pre>\r\n\r\n<span style=\"font-size: 14px;\">Here is a plot of the training<\/span>\r\n<h6><\/h6>\r\n<a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/FirstLoop1.gif\"><img decoding=\"async\" loading=\"lazy\" width=\"564\" height=\"504\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/FirstLoop1.gif\" alt=\"\" class=\"alignnone size-large wp-image-3867\" \/><\/a>\r\n\r\n<h6><\/h6>\r\nAfter 100 iterations, we visualize the training getting closer and closer to the optimal value. \r\n\r\n<h6><\/h6>\r\n\r\n<span style=\"font-size: 15px; color: #f27d1d;\"><strong>Importance of Learning Rate<\/strong><\/span>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">As noted before, the <strong>learning rate<\/strong> is how quickly or how large of leaps we can take to move towards the optimal solution. A larger step could mean you arrive at your solution faster, but you could jump right over it and never converge.<\/span>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">Here is the same training loop above with a learn_rate = 0.2 instead of 0.1: having too high a learning rate means we could diverge and get stuck on a sub-optimal solution forever. <\/span>\r\n<h6><\/h6>\r\n\r\n<a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/LR_secondround.gif\"><img decoding=\"async\" loading=\"lazy\" width=\"564\" height=\"504\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/LR_secondround.gif\" alt=\"\" class=\"alignnone size-full wp-image-3869\" \/><\/a>\r\n<h6><\/h6>\r\nNotice this training doesn't converge toward an optimal solution, caused by a higher learning rate.\r\n<h6><\/h6>\r\n\r\n<span style=\"font-size: 14px;\">So a large learning rate can be problematic, but a smaller step could mean you are moving way too slowly (yawn), or we could end up in a local minimum, or it could take a lifetime to finish.<\/span>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">So, what do you do? One suggestion would be to start with a higher learning rate in the beginning, and then move to a lower learning rate as you get closer to a solution, which we can be fancy and call this \"time-based decay.\" <\/span>\r\n\r\n<span style=\"font-size: 14px;\">Enter <strong>custom learning rates<\/strong>, which are simple to implement* in our custom training plot.<\/span>\r\n\r\n<pre><span class=\"comment\">% Setup custom learn rate<\/span>\r\ninitialLearnRate = 0.2;\r\ndecay = 0.01;\r\nlearn_rate = initialLearnRate;\r\n\r\nfor ii = 1:100\r\n   <span class=\"comment\">%   . . .<\/span>\r\n   <span class=\"comment\"> % update custom learning rate<\/span>\r\n    learn_rate = initialLearnRate \/(1 + decay*iteration);\r\n    \r\nend<\/pre>\r\n\r\n\r\n<em>*of course, the learning rate function could be complex, but adding it into our training loop is straightforward.<\/em>\r\n\r\n<h6><\/h6>\r\n\r\n<h6><\/h6>\r\n\r\n<table>\r\n<tbody>\r\n<tr>\r\n<td><span style=\"font-size: 14px;\">The learning rate over the 100 iterations looks like this:<\/span><\/td>\r\n<td><span style=\"font-size: 14px;\">Which causes our model to converge much more cleanly:<\/span><\/td>\r\n<\/tr>\r\n<tr>\r\n<td><a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/learn-rate-decay.png\"><img decoding=\"async\" loading=\"lazy\" width=\"562\" height=\"506\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/learn-rate-decay.png\" alt=\"\" class=\"alignnone size-full wp-image-3851\" \/><\/a><\/td>\r\n<td><a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/custom-learn-rate2.gif\"><img decoding=\"async\" loading=\"lazy\" width=\"564\" height=\"504\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/custom-learn-rate2.gif\" alt=\"\" class=\"alignnone size-large wp-image-3865\" \/><\/a>\r\n<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n\r\n<h6><\/h6>\r\n\r\n<h6><\/h6>\r\n\r\n\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">That's it! You now are fully prepared to discuss the following terms in deep learning conversations.<\/span>\r\n<h6><\/h6>\r\n<span style=\"font-size: 14px;\">MATLAB functions:<\/span>\r\n\r\n<ul>\r\n\t<li><span style=\"font-size: 14px;\">dlfeval<\/span><\/li>\r\n\r\n\t<li><span style=\"font-size: 14px;\">dlgradient<\/span><\/li>\r\n\r\n\t<li><span style=\"font-size: 14px;\">dlarray & extractdata<\/span><\/li>\r\n\r\n<\/ul>\r\n<span style=\"font-size: 14px;\">Terms:<\/span>\r\n<ul>\r\n\r\n<li><span style=\"font-size: 14px;\">Automatic Differentiation:<\/span><\/li>\r\n<li><span style=\"font-size: 14px;\">Loss<\/span><\/li>\r\n<li><span style=\"font-size: 14px;\">Model Gradients<\/span><\/li>\r\n<li><span style=\"font-size: 14px;\">Learning Rate<\/span><\/li>\r\n\r\n\r\n<\/ul>\r\n\r\n<span style=\"font-size: 14px;\">Resources: To prepare for this post, I went through quite a few articles in our documentation. Please look over the following links for the source material:<\/span>\r\n<h6><\/h6>\r\n<ul>\r\n\r\n<li><a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ref\/dlfeval.html\">dlfeval<\/a>: Walks through setting up rosenbrock equation too. <\/li> \r\n\r\n<li>More on <a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ug\/include-automatic-differentiation.html\">Automatic Differentiation<\/a><\/li>\r\n\r\n<li>More on <a href=\"https:\/\/www.mathworks.com\/help\/optim\/ug\/banana-function-minimization.html\">Rosenbrock solution<\/a>, sometimes called the banana function, which was the optimization function used in the example<\/li>\r\n\r\n<li><a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ug\/train-network-using-custom-training-loop.html\">Full Deep Learning Example<\/a><\/li>\r\n\r\n<li><a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ug\/define-custom-deep-learning-layers.html\">create custom deep learning layers<\/a><\/li>\r\n\r\n<\/ul>\r\n\r\n","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/03\/trainingLoopVisual_resized.png\" onError=\"this.style.display ='none';\" \/><\/div><p>Key terms in custom training loops\r\n\r\nIn this post, I would like to go into detail on Loss, Model Gradients, and Automatic Differentiation\r\n\r\n\r\n\r\n\r\nThis is Part 2 in a series of Advanced Deep... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2020\/03\/06\/advanced-deep-learning-key-terms\/\">read more >><\/a><\/p>","protected":false},"author":156,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/3821"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/users\/156"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/comments?post=3821"}],"version-history":[{"count":56,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/3821\/revisions"}],"predecessor-version":[{"id":6117,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/3821\/revisions\/6117"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media?parent=3821"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/categories?post=3821"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/tags?post=3821"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}