File Exchange Pick of the Week

Our best user submissions

Extract text from PDF documents 4

Posted by Jiro Doke,

Jiro‘s pick this week is “Read text from a PDF document” by Derek Wood.

Ah, this is a nice entry. I was hoping for something like this. I keep track of my household expenses using MATLAB. I know, I know. Online banking now makes it easy to manage your expense, but I like using MATLAB to give me various views into my finances. One of the tasks I’m currently doing manually is entering of the expenses into my program. Some bank statements can be downloaded as CSV files, but one of my financial institutions only provide PDF files for the statements. For those statements, I would enter them in manually.

Derek’s pdfRead lets me automate this! His function, reads in any text information found in the PDF file. For a structured PDF file, like a bank statement, it’s fairly easy to extract out the necessary information from that text.

Just to show you how it works, I saved our MathWorks Blogs top page as a PDF file.

Then, I simply called pdfRead.

p = pdfRead('blogs.pdf');
p{1}
ans =
    'Get the inside view on MATLAB & Simulink!
     Cleve’s Corner: Cleve Moler 
     on Mathematics and 
     Computing
     Scientific computing, math & more
     Loren on the Art of MATLAB
     Turn ideas into MATLAB
     Guy on Simulink
     Simulink & Model-Based Design
     Steve on Image Processing
     Concepts, algorithms & MATLAB
     File Exchange Pick of the 
     Week
     Our best user submissions
     Stuart’s MATLAB Videos
     Watch and Learn
     Developer Zone
     Advanced Software Development with 
     MATLAB
     Behind the Headlines
     MATLAB and Simulink behind today’s 
     news and trends
     Hans on IoT
     ThingSpeak, MATLAB, and the 
     Internet of Things
     Racing Lounge
     Best practices and teamwork for 
     student competitions
     MATLAB Community
     MATLAB, community & more
     Recent Posts
     JUL 20 Send Bulk Sensor Data to ThingSpeak for Analysis by Hans Scharler
     JUL 18 MIT’s new robot can 3D print a building... by Lisa Harvey
     JUL 17 What is the Condition Number of a Matrix? by Cleve Moler
     JUL 14 Juno Delivers by Steve Eddins (1)
     JUL 14 What are the functional inputs and outputs of... by Guest Picker
     JUL 12 Developing a Function that Replicates an Excel Worksheet... by Stuart McGarrity
     JUL 10 Web Scraping and Mining Unstructured Data with MATLAB by Loren Shure
     JUL 7 Watering my Plants with Simscape Fluids by Guy Rouleau
     JUL 6 Don’t Mock Me! by Andy Campbell
     JUL 5 Building practical skills through student competitions by Christoph Hahn
     JUN 30 Cody Turns One Million by Ned Gulley (2)'

Comments

Give it a try and let us know what you think here or leave a comment for Derek.


Get the MATLAB code

Published with MATLAB® R2017a

4 CommentsOldest to Newest

gm4704 replied on : 1 of 4

I tried using this but doesn’t work. should we include any other library for this functionality t work?

gm4704 replied on : 3 of 4

yea I did try that.. but it says

Warning: Invalid file or
directory
‘E:\xxxxxx\iText-4.2.0-com.itextpdf.jar’.
> In javaclasspath>local_validate_dynamic_path (line 266)
In javaclasspath>local_javapath (line 182)
In javaclasspath (line 119)
In javaaddpath (line 71)

Are you sure the file “iText-4.2.0-com.itextpdf.jar” exists in your current folder?

I also suggest contacting the author of this File Exchange entry.

Add A Comment

What is 2 + 7?

Preview: hide