{"id":912,"date":"2013-11-03T19:20:25","date_gmt":"2013-11-04T00:20:25","guid":{"rendered":"https:\/\/blogs.mathworks.com\/steve\/?p=912"},"modified":"2019-11-01T09:34:08","modified_gmt":"2019-11-01T13:34:08","slug":"chess-and-a-little-text-file-manipulation","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/steve\/2013\/11\/03\/chess-and-a-little-text-file-manipulation\/","title":{"rendered":"Chess and a little text file manipulation"},"content":{"rendered":"\r\n<div class=\"content\"><p>Here's an image of a chess position:<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/2013\/position1.png\" alt=\"\"> <\/p><p>And that's about as close to image processing as today's blog post will come. Because this post is really about text processing.<\/p><p>It seems like a lot of computational tasks in engineering and science involve manipulating data in text files. This weekend I had such a task, although I must admit that it had nothing to do with engineering or science. Even so, I thought the task would be a good illustration of some basic text processing techniques.<\/p><p>I have a text file, tactics.pgn, that is a database of chess tactics puzzles. Here are lines 2,759 through 2,784 from the file.<\/p><pre>[Event \"?\"]\r\n[Site \"?\"]\r\n[Date \"????.??.??\"]\r\n[Round \"?\"]\r\n[White \"Diagram 211\"]\r\n[Black \"\"]\r\n[Result \"*\"]\r\n[EventDate \"2006.05.20\"]\r\n[FEN \"2b1R3\/ppk4p\/8\/2q2p2\/2Br2n1\/2QP2N1\/P4PPP\/6K1 w - - 0 1\"]\r\n[SetUp \"1\"]\r\n[SourceDate \"2011.02.22\"]<\/pre><pre>1.Rxc8+ Kxc8 2.Be6+ Kd8 3.Qxc5 *<\/pre><pre>[Event \"?\"]\r\n[Site \"?\"]\r\n[Date \"????.??.??\"]\r\n[Round \"?\"]\r\n[White \"Diagram 212\"]\r\n[Black \"\"]\r\n[Result \"*\"]\r\n[FEN \"3rr1k1\/p1p2ppp\/Q2b1q2\/8\/3Np3\/4P3\/PP3PPP\/R1B2RK1 b - - 0 1\"]\r\n[SetUp \"1\"]\r\n[SourceDate \"2011.02.22\"]<\/pre><pre>1...Bxh2+ 2.Kxh2 Qxa6 *<\/pre><p>These lines store two chess positions. The first is the position I showed above, with White to move. Can you figure out how White wins by capturing the Bishop on c8 with his rook? (If you're interested in how the position is encoded in the text, see the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Forsyth?Edwards_Notation\">Wikipedia article on Forsyth-Edwards Notation<\/a>, or FEN.)<\/p><p>I wanted to shuffle the positions in this file randomly. None of the chess software programs that I have can do this, so I decided to tackle it with MATLAB. (Fun thing to do on a Saturday morning, right?)<\/p><p>OK, so here's the basic procedure:<\/p><div><ol><li>Read the entire file into MATLAB.<\/li><li>Split the data into chunks, one chunk for each position.<\/li><li>Rearrange the chunks randomly.<\/li><li>Write out the rearranged chunks to a new text file.<\/li><\/ol><\/div><p>There are a lot of different ways to do this. Here's what I came up with.<\/p><p>First, read in the entire file. The function <tt>fileread<\/tt> is just the ticket.<\/p><pre class=\"codeinput\">characters = fileread(<span class=\"string\">'tactics.pgn'<\/span>);\r\nsize(characters)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n           1      106394\r\n\r\n<\/pre><p>You can see that there are about 106,000 characters in the file. Let's split the data into lines using <tt>strsplit<\/tt>.<\/p><pre class=\"codeinput\">lines = strsplit(characters,<span class=\"string\">'\\n'<\/span>)';\r\nsize(lines)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n        4838           1\r\n\r\n<\/pre><p>There are about 4,800 lines of text. But how many positions are there? I'm going to find the starting line of each position by searching for the string \"[Event \" at the beginning a line. It's time for <tt>regexp<\/tt>.<\/p><pre class=\"codeinput\">idx = regexp(lines,<span class=\"string\">'^[Event '<\/span>);\r\nidx(1:15)\r\n<\/pre><pre class=\"codeoutput\">\r\nans = \r\n\r\n    [1]\r\n    []\r\n    []\r\n    []\r\n    []\r\n    []\r\n    []\r\n    []\r\n    []\r\n    []\r\n    []\r\n    [1]\r\n    []\r\n    []\r\n    []\r\n\r\n<\/pre><p>This shows us that this string is found twice in the first 15 lines of the file. <tt>idx<\/tt> is a cell array, so I'll use <tt>cellfun<\/tt> and <tt>find<\/tt> to identify all the lines that contain the matching string. Each of these lines is the start of an entry for one chess position.<\/p><pre class=\"codeinput\">first_lines = find(~cellfun(@isempty,idx));\r\nfirst_lines(1:3)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n     1\r\n    12\r\n    23\r\n\r\n<\/pre><p>So there are positions starting on lines 1, 12, and 23.<\/p><p>Next, I'll make a cell array such that that each cell contains all the lines for one position. To make the for-loop work, I'm going to add an \"extra\" value to the first_lines vector that points to a nonexistent line just past the of the file.<\/p><pre class=\"codeinput\">first_lines(end+1) = length(lines) + 1;\r\n<span class=\"keyword\">for<\/span> k = 1:length(first_lines)-1\r\n    positions{k} = lines(first_lines(k):first_lines(k+1)-1);\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><p>Let's take a look at what we have now.<\/p><pre class=\"codeinput\">size(positions)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n     1   421\r\n\r\n<\/pre><p>There are 421 positions in the file. For example:<\/p><pre class=\"codeinput\">positions{205}\r\n<\/pre><pre class=\"codeoutput\">\r\nans = \r\n\r\n    '[Event \"?\"]'\r\n    '[Site \"?\"]'\r\n    '[Date \"????.??.??\"]'\r\n    '[Round \"?\"]'\r\n    '[White \"Diagram 211\"]'\r\n    '[Black \"Bain\"]'\r\n    '[Result \"*\"]'\r\n    '[EventDate \"2006.05.20\"]'\r\n    '[FEN \"2b1R3\/ppk4p\/8\/2q2p2\/2Br2n1\/2QP2N1\/P4PPP\/6K1 w - - 0 1\"]'\r\n    '[SetUp \"1\"]'\r\n    '[SourceDate \"2011.02.22\"]'\r\n    '1.Rxc8+ Kxc8 2.Be6+ Kd8 3.Qxc5 *'\r\n\r\n<\/pre><p>Again, this is the position shown at top of this post.<\/p><p>Getting near the end, now. It's time to rearrange the positions. Before I do that, though, I'll shuffle the random number generator. I only do this so that if I repeat these steps in a new MATLAB session, I'll be sure to get a different result. After shuffling the random number generator using <tt>rng<\/tt>, a quick call to <tt>randperm<\/tt> randomly rearranges the positions.<\/p><pre class=\"codeinput\">rng <span class=\"string\">shuffle<\/span>\r\nshuffled_positions = positions(randperm(length(positions)));\r\n<\/pre><p>We've arrived at the last step: writing out the shuffled positions to a new file.<\/p><pre class=\"codeinput\">fid = fopen(<span class=\"string\">'shuffled_tactics.pgn'<\/span>,<span class=\"string\">'w'<\/span>);\r\n<span class=\"keyword\">for<\/span> k = 1:length(shuffled_positions)\r\n    position = shuffled_positions{k};\r\n    <span class=\"keyword\">for<\/span> p = 1:length(position)\r\n        fprintf(fid,<span class=\"string\">'%s\\n'<\/span>,position{p});\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nfclose(fid);\r\n<\/pre><p>And that's it!<\/p><p>Reading text, manipulating it in some useful way, and writing the results back out -- a common computing task accomplished using several basic MATLAB functions.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_634373db04b94b26a940e22d95d89486() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='634373db04b94b26a940e22d95d89486 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 634373db04b94b26a940e22d95d89486';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2013 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_634373db04b94b26a940e22d95d89486()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><\/div><!--\r\n634373db04b94b26a940e22d95d89486 ##### SOURCE BEGIN #####\r\n%%\r\n% Here's an image of a chess position:\r\n%\r\n% <<https:\/\/blogs.mathworks.com\/images\/steve\/2013\/position1.png>>\r\n%\r\n% And that's about as close to image processing as today's blog post will\r\n% come. Because this post is really about text processing.\r\n%\r\n% It seems like a lot of computational tasks in engineering and science\r\n% involve manipulating data in text files. This weekend I had such a task,\r\n% although I must admit that it had nothing to do with engineering or\r\n% science. Even so, I thought the task would be a good illustration of some\r\n% basic text processing techniques.\r\n%\r\n% I have a text file, tactics.pgn, that is a database of chess tactics\r\n% puzzles. Here are lines 2,759 through 2,784 from the file.\r\n%\r\n%  [Event \"?\"]\r\n%  [Site \"?\"]\r\n%  [Date \"????.??.??\"]\r\n%  [Round \"?\"]\r\n%  [White \"Diagram 211\"]\r\n%  [Black \"\"]\r\n%  [Result \"*\"]\r\n%  [EventDate \"2006.05.20\"]\r\n%  [FEN \"2b1R3\/ppk4p\/8\/2q2p2\/2Br2n1\/2QP2N1\/P4PPP\/6K1 w - - 0 1\"]\r\n%  [SetUp \"1\"]\r\n%  [SourceDate \"2011.02.22\"]\r\n%  \r\n%  1.Rxc8+ Kxc8 2.Be6+ Kd8 3.Qxc5 *\r\n%  \r\n%  [Event \"?\"]\r\n%  [Site \"?\"]\r\n%  [Date \"????.??.??\"]\r\n%  [Round \"?\"]\r\n%  [White \"Diagram 212\"]\r\n%  [Black \"\"]\r\n%  [Result \"*\"]\r\n%  [FEN \"3rr1k1\/p1p2ppp\/Q2b1q2\/8\/3Np3\/4P3\/PP3PPP\/R1B2RK1 b - - 0 1\"]\r\n%  [SetUp \"1\"]\r\n%  [SourceDate \"2011.02.22\"]\r\n%  \r\n%  1...Bxh2+ 2.Kxh2 Qxa6 *\r\n%\r\n% These lines store two chess positions. The first is the position I showed\r\n% above, with White to move. Can you figure out how White wins by capturing\r\n% the Bishop on c8 with his rook? (If you're interested in how the position\r\n% is encoded in the text, see the \r\n% <http:\/\/en.wikipedia.org\/wiki\/Forsyth?Edwards_Notation \r\n% Wikipedia article on Forsyth-Edwards Notation>, or FEN.)\r\n%\r\n% I wanted to shuffle the positions in this file randomly. None of the chess\r\n% software programs that I have can do this, so I decided to tackle it with\r\n% MATLAB. (Fun thing to do on a Saturday morning, right?)\r\n%\r\n% OK, so here's the basic procedure:\r\n%\r\n% # Read the entire file into MATLAB.\r\n% # Split the data into chunks, one chunk for each position.\r\n% # Rearrange the chunks randomly.\r\n% # Write out the rearranged chunks to a new text file.\r\n%\r\n% There are a lot of different ways to do this. Here's what I came up with.\r\n%\r\n% First, read in the entire file. The function |fileread| is just the\r\n% ticket.\r\n\r\ncharacters = fileread('tactics.pgn');\r\nsize(characters)\r\n\r\n%%\r\n% You can see that there are about 106,000 characters in the file. Let's\r\n% split the data into lines using |strsplit|.\r\n\r\nlines = strsplit(characters,'\\n')';\r\nsize(lines)\r\n\r\n%%\r\n% There are about 4,800 lines of text. But how many positions are there?\r\n% I'm going to find the starting line of each position by searching for the\r\n% string \"[Event \" at the beginning a line. It's time for |regexp|.\r\n\r\nidx = regexp(lines,'^[Event ');\r\nidx(1:15)\r\n\r\n%%\r\n% This shows us that this string is found twice in the first 15 lines of\r\n% the file. |idx| is a cell array, so I'll use |cellfun| and |find| to\r\n% identify all the lines that contain the matching string. Each of these\r\n% lines is the start of an entry for one chess position.\r\n\r\nfirst_lines = find(~cellfun(@isempty,idx));\r\nfirst_lines(1:3)\r\n\r\n%%\r\n% So there are positions starting on lines 1, 12, and 23.\r\n%\r\n% Next, I'll make a cell array such that that each cell contains all the\r\n% lines for one position. To make the for-loop work, I'm going to add an\r\n% \"extra\" value to the first_lines vector that points to a nonexistent line\r\n% just past the of the file.\r\n\r\nfirst_lines(end+1) = length(lines) + 1;\r\nfor k = 1:length(first_lines)-1\r\n    positions{k} = lines(first_lines(k):first_lines(k+1)-1);\r\nend\r\n\r\n%%\r\n% Let's take a look at what we have now. \r\n\r\nsize(positions)\r\n\r\n%%\r\n% There are 421 positions in the file. For example:\r\n\r\npositions{205}\r\n\r\n%%\r\n% Again, this is the position shown at top of this post.\r\n%\r\n% Getting near the end, now. It's time to rearrange the positions. Before I\r\n% do that, though, I'll shuffle the random number generator. I only do this\r\n% so that if I repeat these steps in a new MATLAB session, I'll be sure to\r\n% get a different result. After shuffling the random number generator using\r\n% |rng|, a quick call to |randperm| randomly rearranges the positions.\r\n\r\nrng shuffle\r\nshuffled_positions = positions(randperm(length(positions)));\r\n\r\n%%\r\n% We've arrived at the last step: writing out the shuffled positions to a\r\n% new file.\r\n\r\nfid = fopen('shuffled_tactics.pgn','w');\r\nfor k = 1:length(shuffled_positions)\r\n    position = shuffled_positions{k};\r\n    for p = 1:length(position)\r\n        fprintf(fid,'%s\\n',position{p});\r\n    end\r\nend\r\nfclose(fid);\r\n\r\n%%\r\n% And that's it!\r\n%\r\n% Reading text, manipulating it in some useful way, and writing the results\r\n% back out REPLACE_WITH_DASH_DASH a common computing task accomplished using several basic\r\n% MATLAB functions.\r\n\r\n\r\n\r\n##### SOURCE END ##### 634373db04b94b26a940e22d95d89486\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/2013\/position1.png\" onError=\"this.style.display ='none';\" \/><\/div><p>\r\nHere's an image of a chess position: And that's about as close to image processing as today's blog post will come. Because this post is really about text processing.It seems like a lot of... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/steve\/2013\/11\/03\/chess-and-a-little-text-file-manipulation\/\">read more >><\/a><\/p>","protected":false},"author":42,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[611,681,999,348,677,466,472,705,1037,623,807,190,1003],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/912"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/users\/42"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/comments?post=912"}],"version-history":[{"count":3,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/912\/revisions"}],"predecessor-version":[{"id":915,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/912\/revisions\/915"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/media?parent=912"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/categories?post=912"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/tags?post=912"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}