{"id":8767,"date":"2017-07-21T09:00:04","date_gmt":"2017-07-21T13:00:04","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/?p=8767"},"modified":"2017-07-21T00:36:58","modified_gmt":"2017-07-21T04:36:58","slug":"extract-text-from-pdf-documents","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2017\/07\/21\/extract-text-from-pdf-documents\/","title":{"rendered":"Extract text from PDF documents"},"content":{"rendered":"\r\n\r\n<div class=\"content\"><p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/869871\">Jiro<\/a>&#8216;s pick this week is <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/63615-read-text-from-a-pdf-document\">&#8220;Read text from a PDF document&#8221;<\/a> by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/9380516\">Derek Wood<\/a>.<\/p><p>Ah, this is a nice entry. I was hoping for something like this. I keep track of my household expenses using MATLAB. I know, I know. Online banking now makes it easy to manage your expense, but I like using MATLAB to give me various views into my finances. One of the tasks I&#8217;m currently doing manually is entering of the expenses into my program. Some bank statements can be downloaded as CSV files, but one of my financial institutions only provide PDF files for the statements. For those statements, I would enter them in manually.<\/p><p>Derek&#8217;s <tt>pdfRead<\/tt> lets me automate this! His function, reads in any text information found in the PDF file. For a structured PDF file, like a bank statement, it&#8217;s fairly easy to extract out the necessary information from that text.<\/p><p>Just to show you how it works, I saved our <a href=\"https:\/\/blogs.mathworks.com\/\">MathWorks Blogs<\/a> top page as a PDF file.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/pick\/jiro\/potw_pdf2text\/blog_toppage.png\" alt=\"\"> <\/p><p>Then, I simply called <tt>pdfRead<\/tt>.<\/p><pre class=\"codeinput\">p = pdfRead(<span class=\"string\">'blogs.pdf'<\/span>);\r\np{1}\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n    'Get the inside view on MATLAB &amp; Simulink!\r\n     Cleve&#8217;s Corner: Cleve Moler \r\n     on Mathematics and \r\n     Computing\r\n     Scientific computing, math &amp; more\r\n     Loren on the Art of MATLAB\r\n     Turn ideas into MATLAB\r\n     Guy on Simulink\r\n     Simulink &amp; Model-Based Design\r\n     Steve on Image Processing\r\n     Concepts, algorithms &amp; MATLAB\r\n     File Exchange Pick of the \r\n     Week\r\n     Our best user submissions\r\n     Stuart&#8217;s MATLAB Videos\r\n     Watch and Learn\r\n     Developer Zone\r\n     Advanced Software Development with \r\n     MATLAB\r\n     Behind the Headlines\r\n     MATLAB and Simulink behind today&#8217;s \r\n     news and trends\r\n     Hans on IoT\r\n     ThingSpeak, MATLAB, and the \r\n     Internet of Things\r\n     Racing Lounge\r\n     Best practices and teamwork for \r\n     student competitions\r\n     MATLAB Community\r\n     MATLAB, community &amp; more\r\n     Recent Posts\r\n     JUL 20 Send Bulk Sensor Data to ThingSpeak for Analysis by Hans Scharler\r\n     JUL 18 MIT&#8217;s new robot can 3D print a building... by Lisa Harvey\r\n     JUL 17 What is the Condition Number of a Matrix? by Cleve Moler\r\n     JUL 14 Juno Delivers by Steve Eddins (1)\r\n     JUL 14 What are the functional inputs and outputs of... by Guest Picker\r\n     JUL 12 Developing a Function that Replicates an Excel Worksheet... by Stuart McGarrity\r\n     JUL 10 Web Scraping and Mining Unstructured Data with MATLAB by Loren Shure\r\n     JUL 7 Watering my Plants with Simscape Fluids by Guy Rouleau\r\n     JUL 6 Don&#8217;t Mock Me! by Andy Campbell\r\n     JUL 5 Building practical skills through student competitions by Christoph Hahn\r\n     JUN 30 Cody Turns One Million by Ned Gulley (2)'\r\n<\/pre><p><b>Comments<\/b><\/p><p>Give it a try and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=8767#respond\">here<\/a> or leave a <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/63615-read-text-from-a-pdf-document#comment\">comment<\/a> for Derek.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_20d4d850ab184a5b9c301c03a390f5b9() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='20d4d850ab184a5b9c301c03a390f5b9 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 20d4d850ab184a5b9c301c03a390f5b9';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_20d4d850ab184a5b9c301c03a390f5b9()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\n20d4d850ab184a5b9c301c03a390f5b9 ##### SOURCE BEGIN #####\r\n%%\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/869871 Jiro>'s\r\n% pick this week is\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/63615-read-text-from-a-pdf-document \"Read text\r\n% from a PDF document\"> by\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/9380516 Derek\r\n% Wood>.\r\n%\r\n% Ah, this is a nice entry. I was hoping for something like this. I keep\r\n% track of my household expenses using MATLAB. I know, I know. Online\r\n% banking now makes it easy to manage your expense, but I like using MATLAB\r\n% to give me various views into my finances. One of the tasks I'm currently\r\n% doing manually is entering of the expenses into my program. Some bank\r\n% statements can be downloaded as CSV files, but one of my financial\r\n% institutions only provide PDF files for the statements. For those\r\n% statements, I would enter them in manually.\r\n%\r\n% Derek's |pdfRead| lets me automate this! His function, reads in any text\r\n% information found in the PDF file. For a structured PDF file, like a bank\r\n% statement, it's fairly easy to extract out the necessary information from\r\n% that text.\r\n%\r\n% Just to show you how it works, I saved our <https:\/\/blogs.mathworks.com\/\r\n% MathWorks Blogs> top page as a PDF file.\r\n%\r\n% <<blog_toppage.png>>\r\n%\r\n% Then, I simply called |pdfRead|.\r\n\r\np = pdfRead('blogs.pdf');\r\np{1}\r\n\r\n%%\r\n% *Comments*\r\n%\r\n% Give it a try and let us know what you think\r\n% <https:\/\/blogs.mathworks.com\/pick\/?p=8767#respond here> or leave a\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/63615-read-text-from-a-pdf-document#comment\r\n% comment> for Derek.\r\n\r\n##### SOURCE END ##### 20d4d850ab184a5b9c301c03a390f5b9\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/pick\/jiro\/potw_pdf2text\/blog_toppage.png\" onError=\"this.style.display ='none';\" \/><\/div><p>\r\n\r\nJiro&#8216;s pick this week is &#8220;Read text from a PDF document&#8221; by Derek Wood.Ah, this is a nice entry. I was hoping for something like this. I keep track of my household expenses&#8230; <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2017\/07\/21\/extract-text-from-pdf-documents\/\">read more >><\/a><\/p>","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8767"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=8767"}],"version-history":[{"count":2,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8767\/revisions"}],"predecessor-version":[{"id":8769,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8767\/revisions\/8769"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=8767"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=8767"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=8767"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}