I recently have been using webread to scan websites and mine information to do data analysis with.
This requires a lot of regular expressions. Regular expressions are one of those things that are incredibly frustrating but fun at the same time. I'm usually looking for some text pattern inside of html tags which means I'm going to be grabbing 'tokens' or the unknown part that matches the expression.
Let's do a simple example where we grab the list of MathWorks' products from the website https://www.mathworks.com/products.html.
First, let's identify the pattern we'll be looking for. I like to do this in a web browser thanks to the syntax highlighting and other editor features. Here's a first view:
There are two patterns we need to parameterize. The first, in yellow is the product reference. The second, in green, is the product name, the token we want to capture.
Now let's code this up. First, we'll grab the html.
html = webread('https://www.mathworks.com/products.html');
Next, we'll build the regular expression. It's always nice to keep the doc page open for this.
- Match the string literal "/product/"
- Match any "" characters "\w" or hyphens "\w"
- As many times as possible "*"
- Match the next backslash and closing double quote and greater than sign "/""
- Start the token with parenthesis (
- Match any words, hyphens, or spaces "\s" as many times as possible.
- Close the token )
expr = '/products/[\w\-]*/">([\w\-\s]*)';
Run the regular expression capturing tokens.
tokens = regexp(html, expr, 'tokens');
This is where Yung-Yeh's file comes in. The output from regexp with tokens is a nested cell that can have many nesting levels depending on number of tokens and token nesting level. cellflat allows me to flatten it as many levels as necessary into a cell string.
tokens = cellflat(tokens);
Now we can look at the unique products with white space at the end removed.
'...' 'Aerospace Blockset' 'Aerospace Toolbox' 'Antenna Toolbox' 'Audio System Toolbox' 'Bioinformatics Toolbox' 'Communications System Toolbox' 'Computer Vision System Toolbox' 'Control System Toolbox' 'Curve Fitting Toolbox' 'DO Qualification Kit' 'DSP System Toolbox' 'Data Acquisition Toolbox' 'Database Toolbox' 'Datafeed Toolbox' 'Econometrics Toolbox' 'Embedded Coder' 'Filter Design HDL Coder' 'Financial Instruments Toolbox' 'Financial Toolbox' 'Fixed-Point Designer' 'Fuzzy Logic Toolbox' 'Global Optimization Toolbox' 'HDL Coder' 'HDL Verifier' 'IEC Certification Kit' 'Image Acquisition Toolbox' 'Image Processing Toolbox' 'Instrument Control Toolbox' 'LTE System Toolbox' 'MATLAB' 'MATLAB Coder' 'MATLAB Compiler' 'MATLAB Compiler SDK' 'MATLAB Distributed Computing Server' 'MATLAB Mobile' 'MATLAB Production Server' 'MATLAB Report Generator' 'MATLAB for Home Use' 'Mapping Toolbox' 'Model Predictive Control Toolbox' 'Model-Based Calibration Toolbox' 'Neural Network Toolbox' 'OPC Toolbox' 'Optimization Toolbox' 'Parallel Computing Toolbox' 'Partial Differential Equation Toolbox' 'Phased Array System Toolbox' 'Polyspace Bug Finder' 'Polyspace Code Prover' 'RF Toolbox' 'Robotics System Toolbox' 'Robust Control Toolbox' 'Signal Processing Toolbox' 'SimBiology' 'SimEvents' 'SimRF' 'Simscape' 'Simscape Driveline' 'Simscape Electronics' 'Simscape Fluids' 'Simscape Multibody' 'Simscape Power Systems' 'Simulink' 'Simulink 3D Animation' 'Simulink Code Inspector' 'Simulink Coder' 'Simulink Control Design' 'Simulink Design Optimization' 'Simulink Design Verifier' 'Simulink Desktop Real-Time' 'Simulink PLC Coder' 'Simulink Real-Time' 'Simulink Report Generator' 'Simulink Test' 'Simulink Verification and Validation' 'Spreadsheet Link' 'Stateflow' 'Statistics and Machine Learning Toolbox' 'Symbolic Math Toolbox' 'System Identification Toolbox' 'Trading Toolbox' 'Vehicle Network Toolbox' 'Vision HDL Toolbox' 'WLAN System Toolbox' 'Wavelet Toolbox'
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.