XML and MATLAB: Navigating a Tree

This week I’m posting the third part in my series on using XML. Since I’ve had a request to cover this topic, I’ve moved it up in the schedule. We’ll be back to the new MATLAB R2010b features next week.

Last time in my XML in MATLAB series I explained the steps needed to create an XML DOM structure and build up an XML tree. This week I answer the question:” now that I have a tree, how can I extract data from it?” I’ll continue to use the AddressBook example from the last post. Remember, you can create a new tree or read one into MATLAB using the xmlwrite function.

For your reference, here are the other parts in the series:

There are at least two ways to navigate the tree in MATLAB. Both of the ways I describe here once again take advantage of the Java environment that runs with MATLAB. The first way makes use of the structure of the tree and relationship of the nodes, the second uses the XPath language to precisely pick out a node. Once again, here is the example tree:

<?xml version="1.0" encoding="utf-8"?>
<Entry>
<Name>Friendly J. Mathworker</Name>
<PhoneNumber>(508) 647-7000</PhoneNumber>
</Entry>


Let’s say I want to find Friendly’s phone number. To do this I’m going to start the root node, “AddressBook.” From there I will walk down the tree to AddressBook/Entry/PhoneNumber and get the the text of the PhoneNumber node.

% Get the "AddressBook" node
% Get all the "Entry" nodes
% Get the first "Entry"'s children
% Remember that java arrays are zero-based
friendlyInfo = entries.item(0).getChildNodes;
% Iterate over the nodes to find the "PhoneNumber"
% once there are no more siblinings, "node" will be empty
node = friendlyInfo.getFirstChild;
while ~isempty(node)
if strcmpi(node.getNodeName, 'PhoneNumber')
break;
else
node = node.getNextSibling;
end
end
phoneNumber = node.getTextContent

phoneNumber =

(508) 647-7000



The getChildNodes() method returns a list of nodes. There are several ways to navigate the returned node list. In the above example I used getFirstChild() which returns the first child (in this case, the Name node). Then using the getNextSibling() method, I can walk through all the other child nodes to find the one I’m looking for, in this case it’s PhoneNumber. I used the getNodeName() method to get the string value of the node in order to compare it with “PhoneNumber.” If you’re looking at the methods of a node, the getNodeName() method is redundant with the getTagName() method.

Once I have the desired node, I used the getTextContent() method to get the text inside the <PhoneNumber></PhoneNumber> tags. Note that if there are multiple PhoneNumber child nodes of the Entry, this will stop after finding the first one.

Another way to iterate over the children is to use item() method. Note that since this is a Java array, the array indices go from 0 to size-1.

for i=0:friendlyInfo.getLength - 1
if strcmpi(friendlyInfo.item(i).getTagName, 'PhoneNumber')
phoneNumber = friendlyInfo.item(i).getTextContent
end
end

phoneNumber =

(508) 647-7000



Instead of iterating to find the PhoneNumber node, we can use the ElementsByTagName method to find all the elements in the subtree that have a certain name. This then returns a list of matching nodes, which we can iterate, but since I know there’s only one PhoneNumber I just grabbed the 0’th element:

phoneNumber = friendlyInfo.getElementsByTagName('PhoneNumber').item(0).getTextContent

phoneNumber =

(508) 647-7000



Using XPath
XPath is a language for finding nodes in an XML document, and comes with Java. It works similarly to Java’s regular expression engine, in that you create a string that represents nodes you want to match, compile that to an internal representation and then evaluate it on your document. It’s an advanced step, and I can’t think of anything in regular MATLAB that works the same way. XPath expressions can start either from the top of the tree or anywhere within a document or document fragment. Node paths are represented like directory paths, in that that “..” goes up a level, “.” is the same level, and nodes are separated by forward slashes, “/”. In our example, the first phone number of a the first entry would be “AddressBook/Entry/PhoneNumber.” “//” represents anywhere in the document, so “//PhoneNumber” would also match the same nodes.

To use XPath, you first need to create an XPath object from the XPath factory. In the below example, I’ve first imported the xpath package to make it easier to type out all these various java classes. Once you have an XPath object, you can then compile and evaluate the expression.

% get the xpath mechanism into the workspace
import javax.xml.xpath.*
factory = XPathFactory.newInstance;
xpath = factory.newXPath;

% compile and evaluate the XPath Expression
phoneNumberNode = expression.evaluate(docNode, XPathConstants.NODE);
phoneNumber = phoneNumberNode.getTextContent

phoneNumber =

(508) 647-7000



In the above example, the evaluate() method takes the compiled XPath expression and an XPathConstant. This constant tells the expression what type of result to return. In this case, we’ve asked for a NODE, and so we get back the matching node object. But if we change the the constant to STRING, we get back the text of the matched node directly, as in the next example. You can ask also for NODESETs, NUMBERs, and BOOLEANs.

phoneNumber = expression.evaluate(docNode, XPathConstants.STRING)
phoneNumber =

(508) 647-7000



XPath is a complicated topic and probably worthy of it’s own follow-up post. The language is rich enough to precisely pick out any node, entity, attribute, or other piece of a data from an XML document starting anywhere in the tree.

This has been a meatier post than most for me, so please ask lots of follow-up questions or leave comments.

Reference

|