{"id":6234,"date":"2021-03-12T08:28:10","date_gmt":"2021-03-12T13:28:10","guid":{"rendered":"https:\/\/blogs.mathworks.com\/deep-learning\/?p=6234"},"modified":"2021-04-06T15:45:19","modified_gmt":"2021-04-06T19:45:19","slug":"playing-pong-using-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/deep-learning\/2021\/03\/12\/playing-pong-using-reinforcement-learning\/","title":{"rendered":"Playing Pong using Reinforcement Learning"},"content":{"rendered":"<em>The following post is from Christoph Stockhammer, here today to show how to use Reinforcement Learning for a very serious task: playing games. If you would like to learn more about Reinforcement Learning, check out a <strong>free<\/strong>, 2hr training called <span style=\"text-decoration: underline;\"><a href=\"https:\/\/www.mathworks.com\/learn\/tutorials\/reinforcement-learning-onramp.html\">Reinforcement Learning Onramp<\/a><\/span>.\u00a0<\/em>\r\n<h6><\/h6>\r\n<p style=\"font-size: 14px;\">In the 1970s,\u00a0Pong\u00a0was a very popular video arcade game. It is a 2D video game emulating table tennis, i.e. you got a bat\u00a0(a rectangle)\u00a0you\u00a0can\u00a0move vertically\u00a0and try\u00a0to\u00a0hit a \"ball\" (a moving\u00a0square).\u00a0If the ball hits the bounding box of the game, it bounces back\u00a0like a billiard ball.\u00a0If you miss the ball, the opponent scores.<\/p>\r\n<p style=\"text-align: center;\"><img decoding=\"async\" loading=\"lazy\" width=\"612\" height=\"457\" class=\"alignnone size-full wp-image-6456\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2021\/03\/original-game.gif\" alt=\"\" \/><\/p>\r\n\r\n<h6><\/h6>\r\n<p style=\"font-size: 14px;\">A single-player adaptation\u00a0Breakout\u00a0came\u00a0out\u00a0later, where\u00a0the ball had the ability to\u00a0destroy some blocks on the top\u00a0of the screen\u00a0and the bat moved to the bottom of the screen.\u00a0As a consequence,\u00a0the bat\u00a0was now moving horizontally rather than vertically.<\/p>\r\n<p style=\"text-align: center;\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-6414 size-medium\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2021\/02\/Atari-breakout-game-wallpaper-300x186.gif\" alt=\"\" width=\"300\" height=\"186\" \/><\/p>\r\n\r\n<h6><\/h6>\r\n<p style=\"font-size: 14px;\">In this post, I want to describe how you can teach an AI to play a variation of Pong, just with a ceiling where the ball bounces back.<\/p>\r\n<p style=\"font-size: 14px;\">As the title suggests, we will use reinforcement learning for this task. It is\u00a0definitely\u00a0overkill\u00a0in our scenario, but\u00a0who cares!\u00a0Simple\u00a0arcade\u00a0games are a\u00a0beautiful playing ground\u00a0for first steps\u00a0in\u00a0reinforcement\u00a0learning.\u00a0If you are not familiar with RL, take a look at\u00a0<a href=\"https:\/\/www.mathworks.com\/discovery\/reinforcement-learning.html\">this brief guide<\/a>\u00a0or\u00a0<a href=\"https:\/\/www.mathworks.com\/videos\/series\/reinforcement-learning.html\">this video series<\/a>\u00a0that explains basic concepts.<\/p>\r\n<p style=\"font-size: 14px;\">Roughly speaking, implementing reinforcement learning\u00a0generally\u00a0involves these\u00a0four steps:<\/p>\r\n\r\n<ol>\r\n \t<li><span style=\"font-size: 14px;\">Modeling the environment<\/span><\/li>\r\n \t<li><span style=\"font-size: 14px;\">Defining the training method<\/span><\/li>\r\n \t<li><span style=\"font-size: 14px;\">Coming up with a reward function<\/span><\/li>\r\n \t<li><span style=\"font-size: 14px;\">Training the agent<\/span><\/li>\r\n<\/ol>\r\n<p style=\"font-size: 14px;\">For\u00a0the implementation,\u00a0we will use\u00a0<a href=\"https:\/\/www.mathworks.com\/products\/reinforcement-learning.html\">Reinforcement Learning Toolbox<\/a>\u00a0which was first\u00a0released\u00a0in version\u00a0R2019a\u00a0of MATLAB.\u00a0The complete source code can be found here: <a href=\"https:\/\/github.com\/matlab-deep-learning\/playing-Pong-with-deep-reinforcement-learning\">https:\/\/github.com\/matlab-deep-learning\/playing-Pong-with-deep-reinforcement-learning<\/a>.\u00a0So\u00a0let's get started.<\/p>\r\n\r\n<h2>Modelling the environment<\/h2>\r\n<p style=\"font-size: 14px;\">This actually\u00a0requires\u00a0the most\u00a0work\u00a0of all 4 steps:\u00a0You\u00a0have to\u00a0implement the underlying physics,\u00a0i.e.\u00a0what happens if the ball hits the\u00a0boundary of the game\u00a0or the bat\u00a0or just moves across the screen. In addition, you\u00a0want to\u00a0visualize the\u00a0current state\u00a0(ok \u2013 that one is pretty simple in MATLAB).\u00a0Reinforcement Learning Toolbox\u00a0offers a way to define\u00a0custom environments\u00a0based on MATLAB code or Simulink models which we can leverage to model the Pong environment. For this, we inherit from\u00a0rl.env.MATLABEnvironment\u00a0and implement the system's behavior.<\/p>\r\n\r\n<pre>classdef Environment &lt; rl.env.MATLABEnvironment\r\n    \r\n   <span style=\"color: #228b22;\">%<\/span> <span class=\"comment\"> Properties (set properties' attributes accordingly) <\/span>\r\n    properties\r\n        <span class=\"comment\"> % Specify and initialize environment's necessary properties  <\/span>  \r\n        \r\n    <span class=\"comment\">     % X Limit for ball movement <\/span>\r\n        XLim = [-1 1]\r\n        \r\n     <span class=\"comment\">    % Y Limit for ball movement <\/span>\r\n        YLim = [-1.5 1.5]\r\n        \r\n      <span class=\"comment\">   % Radius of the ball <\/span>\r\n        BallRadius = 0.04\r\n        \r\n     <span class=\"comment\">    % Constant ball Velocity <\/span>\r\n        BallVelocity = [2 2]\r\n<\/pre>\r\n<p style=\"font-size: 14px;\">The whole source code can be found at the end of this post.\u00a0While we can visualize the\u00a0environment\u00a0easily\u00a0we did not\u00a0use game screenshots\u00a0as the information to be used for the reinforcement learning. Doing so would be another option\u00a0and would be closer to a human player relying on visual information alone.\u00a0But it\u00a0would\u00a0also\u00a0necessitate convolutional neural networks requiring\u00a0more\u00a0training effort. Instead,\u00a0we encode the current state\u00a0of the game\u00a0in a vector of seven elements which we call observations:<\/p>\r\n\r\n<h6><\/h6>\r\n<table>\r\n<tbody>\r\n<tr>\r\n<td style=\"border: 1px solid black;\">1<\/td>\r\n<td style=\"border: 1px solid black;\">2<\/td>\r\n<td style=\"border: 1px solid black;\">3<\/td>\r\n<td style=\"border: 1px solid black;\">4<\/td>\r\n<td style=\"border: 1px solid black;\">5<\/td>\r\n<td style=\"border: 1px solid black;\">6<\/td>\r\n<td style=\"border: 1px solid black;\">7<\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"border: 1px solid black;\">Current x-position of the ball<\/td>\r\n<td style=\"border: 1px solid black;\">Current y-position of the ball<\/td>\r\n<td style=\"border: 1px solid black;\">Change in x-position of the ball<\/td>\r\n<td style=\"border: 1px solid black;\">Change in y-position of the ball<\/td>\r\n<td style=\"border: 1px solid black;\">x-position of the bat<\/td>\r\n<td style=\"border: 1px solid black;\">Change in x-position of the bat<\/td>\r\n<td style=\"border: 1px solid black;\">Force applied to the bat<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<h6><\/h6>\r\n<p style=\"font-size: 14px;\">It is easy to see how this captures all relevant information about the current state of the game. The seventh element (\"Force\") probably warrants a more detailed explanation:<\/p>\r\n<p style=\"font-size: 14px;\">Force is basically a scaled version of the action, i.e. moving the bat right or left. This means we feed back the agent's last action as part of the observations, introducing some notion of memory, as the agent has access to previous decision this way.<\/p>\r\n<p style=\"font-size: 14px;\">We can start with a random initial direction of the ball and simulate until the ball hits the floor. If the ball hits the floor, the game is over.<\/p>\r\n\r\n<h2>Defining the training method<\/h2>\r\n<p style=\"font-size: 14px;\">In general, the choice of the training algorithm is influenced by the action and observation spaces. In our case, both the observation (vector with seven elements) and action space (scalar value) are continuous, which means they can assume the values of any floating-point number in a specific range. For instance, we restricted the action to be in a range [-1,1]. Similarly, the x and y positions of the ball are not allowed to exceed certain thresholds, as the ball must stay within the boundaries of the game.<\/p>\r\n<p style=\"font-size: 14px;\">For this example, we use a DDPG (<em>deep deterministic policy gradient<\/em>) agent. The name refers to a specific training approach, there would be <a href=\"https:\/\/mathworks.com\/help\/reinforcement-learning\/ug\/create-agents-for-reinforcement-learning.html\">other choices<\/a> as well. For the training itself we need two components: An <em>actor<\/em> and a <em>critic<\/em>.<\/p>\r\n\r\n<h2>The actor<\/h2>\r\n<p style=\"font-size: 14px;\">The actor decides which action to take in any given situation. At each time step, it receives seven observations from the environment (as listed in the table above) and outputs an action, a number between -1 and 1. This action corresponds to moving the bat to the left very fast (-1), to the right (+1) or not at all (0), with all intermediate levels possible. The actor is a neural network with three fully connected layers and relu activations which is initialized with random values sampled from a normal distribution. The output is scaled so that its values range between -1 and 1.<\/p>\r\n\r\n<h2>The critic<\/h2>\r\n<p style=\"font-size: 14px;\">The critic is the instance computing the expected reward of the actor's actions in the long run based on both the last action and the last observation. As a consequence, the critic takes in two input arguments. Similar to the actor, the critic comprises several fully connected layers, followed by reLu layers. Now, we did not talk about any rewards yet, but it is pretty clear that the actor's performance (no pun intended) can be very good (does not miss a single ball) or bad (does not manage to hit a single ball). The critic is supposed to predict what the long-term outcome of the decisions of the actor will be.<\/p>\r\n<p style=\"font-size: 14px;\">We can use <em>deepNetworkDesigner <\/em>app to define actor and critic networks via dragging and connecting layers from a library. In the screenshot, you can see both the actor network (left) and the critic network (right, note the two input paths). You can also export the code from the app to build a network programmatically.<\/p>\r\n<p style=\"text-align: center;\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"555\" class=\"alignnone size-large wp-image-6420\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2021\/03\/actor_critic-1024x555.png\" alt=\"\" \/><\/p>\r\n\r\n<h6><\/h6>\r\n<h2>Coming up with a reward function<\/h2>\r\n<p style=\"font-size: 14px;\">In general, finding appropriate reward functions (a process called <em>reward shaping<\/em>) can be rather tricky. My personal experience is that this can easily become a time sink. There have been some recent results that propose ways to automate the process. In principle, it is all about nudging the agent to behave as you want it to behave: You reward \"good\" behavior (such as hitting the ball) and you punish \"bad\" behavior (such as missing the ball).<\/p>\r\n<p style=\"font-size: 14px;\">In essence, the agent tries to accumulate as much reward as possible and the underlying neural networks' parameters are continuously updated correspondingly, as long as the game does not reach a terminal state (ball dropped).<\/p>\r\n<p style=\"font-size: 14px;\">To reinforce the 'hit' behavior we reward the agent with a large positive value when the paddle strikes the ball. On the other hand, we penalize the 'miss' behavior by providing a negative reward. We also shape this negative value by making it proportional to the distance between the ball and the paddle at the time of the miss. This incentivizes the agent to move closer to the ball (and eventually strike it) when it is about to miss!<\/p>\r\n\r\n<h2>Training the agent<\/h2>\r\n<p style=\"font-size: 14px;\">This last step is pretty simple as it just boils down to a single function in MATLAB, trainNetwork. However, the training itself can take some time, depending on your termination criteria and available hardware (in many cases, training can be accelerated with the help of a GPU). So grab a cup of coffee, sit back and relax while MATLAB shows a progress display including the sum of the reward of the last completed episode (i.e. playing the game until the ball hit the floor). You can interrupt at any time and use the intermediate result or let it run till completion. Training will stop when termination criteria such as a maximum number of episodes or a specific reward value are met.<\/p>\r\n<p style=\"text-align: center;\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-6452 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2021\/03\/RL_progress-_resized.png\" alt=\"\" width=\"600\" height=\"372\" \/><\/p>\r\n\r\n<h6><\/h6>\r\n<p style=\"font-size: 14px;\">If we manually terminate the training in its early stages, the agent still behaves really clumsy:<\/p>\r\n<p style=\"text-align: center;\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"317\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2021\/03\/gif_training_half.gif\" alt=\"\" class=\"alignnone size-full wp-image-6496\" \/><\/p>\r\n\r\n<h6><\/h6>\r\n<p style=\"font-size: 14px;\">And finally, here is the trained agent in action:<\/p>\r\n<p style=\"text-align: center;\"><img decoding=\"async\" loading=\"lazy\" width=\"595\" height=\"325\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2021\/03\/gif_training_full.gif\" alt=\"\" class=\"alignnone size-full wp-image-6494\" \/><\/p>\r\n\r\n<h6><\/h6>\r\n<h2>Conclusion<\/h2>\r\n<p style=\"font-size: 14px;\">Obviously, you can easily design algorithms without any reinforcement learning for playing <em>Pong<\/em> efficiently, it is really using a sledgehammer to crack a nut - which is fun. However, people are solving real problems with reinforcement learning these days, problems of the kind that were hardly or not at all tractable with traditional approaches.<\/p>\r\n<p style=\"font-size: 14px;\">Also, many people have existing Simulink models that describe complex environments very precisely and can easily be repurposed for step one \u2013 modelling the environment. A video of a customer application using Reinforcement Learning can be found <a href=\"https:\/\/de.mathworks.com\/company\/user_stories\/case-studies\/vitesco-technologies-applies-deep-reinforcement-learning-in-powertrain-control.html\"><strong>here<\/strong><\/a>. Of course, those problems are way more complex and typically require more time for every single step in the above workflow.<\/p>\r\n<p style=\"font-size: 14px;\">However, the good news is that the principles remains the same and you can still leverage the same techniques we used above!<\/p>\r\n<p style=\"font-size: 14px;\"><strong><a href=\"https:\/\/github.com\/matlab-deep-learning\/playing-Pong-with-deep-reinforcement-learning\">Download the full code here<\/a><\/strong>.<\/p>\r\n\r\n<p style=\"font-size: 14px;\">So now it's your turn: Are there any areas where you are considering applying reinforcement learning? Please let me know in the comments.<\/p>","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2021\/03\/original-game.gif\" onError=\"this.style.display ='none';\" \/><\/div><p>The following post is from Christoph Stockhammer, here today to show how to use Reinforcement Learning for a very serious task: playing games. If you would like to learn more about Reinforcement... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2021\/03\/12\/playing-pong-using-reinforcement-learning\/\">read more >><\/a><\/p>","protected":false},"author":156,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/6234"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/users\/156"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/comments?post=6234"}],"version-history":[{"count":37,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/6234\/revisions"}],"predecessor-version":[{"id":6580,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/6234\/revisions\/6580"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media?parent=6234"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/categories?post=6234"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/tags?post=6234"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}