MATLAB Spoken Here

XML and MATLAB: Navigating a Tree 27

Posted by Michael Katz,

This week I’m posting the third part in my series on using XML. Since I’ve had a request to cover this topic, I’ve moved it up in the schedule. We’ll be back to the new MATLAB R2010b features next week.

Last time in my XML in MATLAB series I explained the steps needed to create an XML DOM structure and build up an XML tree. This week I answer the question:” now that I have a tree, how can I extract data from it?” I’ll continue to use the AddressBook example from the last post. Remember, you can create a new tree or read one into MATLAB using the xmlwrite function.

For your reference, here are the other parts in the series:

There are at least two ways to navigate the tree in MATLAB. Both of the ways I describe here once again take advantage of the Java environment that runs with MATLAB. The first way makes use of the structure of the tree and relationship of the nodes, the second uses the XPath language to precisely pick out a node. Once again, here is the example tree:

<?xml version="1.0" encoding="utf-8"?>
<AddressBook>
   <Entry>
      <Name>Friendly J. Mathworker</Name>
      <PhoneNumber>(508) 647-7000</PhoneNumber>
      <Address hasZip="no" type="work">3 Apple Hill Dr, Natick MA</Address>
   </Entry>
</AddressBook>

 

Let’s say I want to find Friendly’s phone number. To do this I’m going to start the root node, “AddressBook.” From there I will walk down the tree to AddressBook/Entry/PhoneNumber and get the the text of the PhoneNumber node.

% Get the "AddressBook" node
addressBookNode = docNode.getDocumentElement;
% Get all the "Entry" nodes
entries = addressBookNode.getChildNodes;
% Get the first "Entry"'s children
% Remember that java arrays are zero-based
friendlyInfo = entries.item(0).getChildNodes;
% Iterate over the nodes to find the "PhoneNumber"
% once there are no more siblinings, "node" will be empty
node = friendlyInfo.getFirstChild;
while ~isempty(node)
    if strcmpi(node.getNodeName, 'PhoneNumber')
        break;
    else
        node = node.getNextSibling;
    end
end
phoneNumber = node.getTextContent
 
phoneNumber =
 
(508) 647-7000
 

 

The getChildNodes() method returns a list of nodes. There are several ways to navigate the returned node list. In the above example I used getFirstChild() which returns the first child (in this case, the Name node). Then using the getNextSibling() method, I can walk through all the other child nodes to find the one I’m looking for, in this case it’s PhoneNumber. I used the getNodeName() method to get the string value of the node in order to compare it with “PhoneNumber.” If you’re looking at the methods of a node, the getNodeName() method is redundant with the getTagName() method.

Once I have the desired node, I used the getTextContent() method to get the text inside the <PhoneNumber></PhoneNumber> tags. Note that if there are multiple PhoneNumber child nodes of the Entry, this will stop after finding the first one.

Another way to iterate over the children is to use item() method. Note that since this is a Java array, the array indices go from 0 to size-1.

for i=0:friendlyInfo.getLength - 1
    if strcmpi(friendlyInfo.item(i).getTagName, 'PhoneNumber')
        phoneNumber = friendlyInfo.item(i).getTextContent
    end
end
 
phoneNumber =
 
(508) 647-7000
 

 

Instead of iterating to find the PhoneNumber node, we can use the ElementsByTagName method to find all the elements in the subtree that have a certain name. This then returns a list of matching nodes, which we can iterate, but since I know there’s only one PhoneNumber I just grabbed the 0’th element:

phoneNumber = friendlyInfo.getElementsByTagName('PhoneNumber').item(0).getTextContent
 
phoneNumber =
 
(508) 647-7000
 

 

Using XPath
XPath is a language for finding nodes in an XML document, and comes with Java. It works similarly to Java’s regular expression engine, in that you create a string that represents nodes you want to match, compile that to an internal representation and then evaluate it on your document. It’s an advanced step, and I can’t think of anything in regular MATLAB that works the same way. XPath expressions can start either from the top of the tree or anywhere within a document or document fragment. Node paths are represented like directory paths, in that that “..” goes up a level, “.” is the same level, and nodes are separated by forward slashes, “/”. In our example, the first phone number of a the first entry would be “AddressBook/Entry/PhoneNumber.” “//” represents anywhere in the document, so “//PhoneNumber” would also match the same nodes.

To use XPath, you first need to create an XPath object from the XPath factory. In the below example, I’ve first imported the xpath package to make it easier to type out all these various java classes. Once you have an XPath object, you can then compile and evaluate the expression.

% get the xpath mechanism into the workspace
import javax.xml.xpath.*
factory = XPathFactory.newInstance;
xpath = factory.newXPath;

% compile and evaluate the XPath Expression
expression = xpath.compile('AddressBook/Entry/PhoneNumber');
phoneNumberNode = expression.evaluate(docNode, XPathConstants.NODE);
phoneNumber = phoneNumberNode.getTextContent
 
phoneNumber =
 
(508) 647-7000
 

 

In the above example, the evaluate() method takes the compiled XPath expression and an XPathConstant. This constant tells the expression what type of result to return. In this case, we’ve asked for a NODE, and so we get back the matching node object. But if we change the the constant to STRING, we get back the text of the matched node directly, as in the next example. You can ask also for NODESETs, NUMBERs, and BOOLEANs.

phoneNumber = expression.evaluate(docNode, XPathConstants.STRING)
phoneNumber =

(508) 647-7000

 

XPath is a complicated topic and probably worthy of it’s own follow-up post. The language is rich enough to precisely pick out any node, entity, attribute, or other piece of a data from an XML document starting anywhere in the tree.

This has been a meatier post than most for me, so please ask lots of follow-up questions or leave comments.

Reference

27 CommentsOldest to Newest

Hi Michael,
Thanks for your information on getting text from XML files, however I am unsure (being a begginner Matlab user) once I obtain the text how do I then export the information into a text or Excel file?
I look forward to hearing from you!
Cassandra

Hi Michael,

Sorry to be a pain, I just worked it out by using the char(), which converts it from a java.lang.string to a character…

This has been bugging me and I’m so thankful to have worked it through!!

Thanks
Cassandra

Hello,
thanks for this great help.
But I have question concerning this topic. I have an XML file that contains nodes in the same level with the same name. Just the Attribute ‘ID’ is different. This would be in your example a second node ‘entry’. The nodes have attributes ‘ID=”1″‘ respectively ‘ID=”2″‘ ( ….). Is there a way to navigate through the XML by these attribute?

Thanks,
Thomas

Thomas,
If you’re trying to retrieve an element, say “<PhoneNumber>” with a specific value in it’s ID attribute, say “work”, something like this might work:

% compile and evaluate the XPath Expression
expression = xpath.compile('/PhoneNumber[@ID=''work''']);
phoneNumberNode = expression.evaluate(docNode, XPathConstants.NODE);
phoneNumber = phoneNumberNode.getTextContent

There’s more info on the XPath syntax here: http://www.w3schools.com/xpath/xpath_syntax.asp

There are a number of XML-related entries on the File Exchange, including one by Matthew Simoneau that shows an example using a NODESET in XPath, which you probably would use if you wanted to retrieve ALL of the nodes that had an “ID” attribute for processing: http://www.mathworks.com/matlabcentral/fileexchange/31382-using-xpath-from-matlab/content/html/xpath.html

Hope that helps.
Rich

Let’s say that I am trying to get a specific number out of the xml file for a specific variable to be used in another function. I have been able to extract it as a text so that I can see the number, but not in a way that I can use it as an input in another function. Please let me know how I can accomplish this.

@MM,

Do you just need to convert from string to a numeric type? Take a look at the STR2NUM function.

E.g.

heightString = heightNode.getTextContent;
heightNum = str2num(char(heightString));

Epic fail.
xpath and Matlab – great for basic structures but completely fails when you introduce namespaces to the xml!

Hi Michael!

I wonder what ‘docNode’ is and how do you get it? It is not obvious from your example!

Thank you in advance,
Regards
/Nasser Hosseini

@Dave,
I’m not sure what you mean. Can you provide example? Namespaces should be addable to the nodes.

@Nasser,
Thanks for that oversight. I explained it in the previous part about creating nodes, but for reference here it is for the example:

docNode = com.mathworks.xml.XMLUtils.createDocument('AddressBook');

Hi,

I am having problems with retrieving data from an xml file. I think I have followed the instructions here.

When I try the following code, the coordinate node returns an empty element. Why is the node not identified?

I’m a beginner so I don’t know if I’ve imported and set up the factory correctly, or if there is something else I don’t understand.

Thanks,
Charlie

filename=’/home/raid/itg/carh5/data_drive/iseo_field_data/bathymetry/Canale_centreline_googleearth.kml';
docNode = xmlread(filename);
documentNode = docNode.getDocumentElement
%%

import javax.xml.xpath.*
factory = XPathFactory.newInstance;
xpath = factory.newXPath;

% compile and evaluate the XPath Expression
expression = xpath.compile(‘Document/Placemark/LineString/coordinates’);
coordinateNode = expression.evaluate(documentNode, XPathConstants.NODE)
data = coordinateNode.getTextContent

Here is also the xml description

Canale centreline.kml

Canale centreline

10.09984114068684,45.80687219300302,0 10.10001647099475,45.80695514950003,0

That xml code didn’t come out right.

Let me try again

Canale centreline.kml

Canale centreline

#m_ylw-pushpin

1

10.09984114068684,45.80687219300302,0 10.10001647099475,45.80695514950003,0

I couldn’t get to the bottom of how to use xpath unfortunately, so I read the xml data using the example shown in the xmlread documentation. This is not ideal, but at least I could get it to work. I couldn’t work out how to learn how the java language worked.

Suggestions would be very welcome.

@Charlie,

It’s hard to say what is going on without understanding the XML. Unfortuanately with our blog software you’d have to replace <‘s and >’s with &lt; and &ampgt to get them to show up here. One thing that you might try is replacing the expression with something like

 //Placemark/LineString/coordinates

Since you’re applying the expression on the document node. I’m not sure what the expectation is with your document, or what it’s structure is. Also, you might want to try using XPathConstants.NODESET instead of XPathConstants.NODE. To get return set of all the matching nodes.

Thanks Michael.

I tried your suggestions. Using // made no difference: I still got the output

coordinateNode =
[]
??? Attempt to reference field of non-structure array.

I also used different nodes (for example //coordinates) but this gave no improvement.

I tried your second suggestion of using NODESET, and got the following response:

>> coordinateNode = expression.evaluate(docNode, XPathConstants.NODESET)
coordinateNode =
net.sf.saxon.dom.DOMNodeList@c3e000
>> data = coordinateNode.getTextContent
??? No appropriate method, property, or field getTextContent for class
net.sf.saxon.dom.DOMNodeList.

>> data = coordinateNode.getLength
data =
0

This is essentially the same response: the node coordinates are not being picked up by the function.

I am struggling to debug this because I don’t really know how the java classes and methods work. Presumably I need a single node to be able to use the getTextContent method.

Here is the essential parts of the xml tree with your suggested replacement. Hopefully this will help you see what is going on here.

To give some context, this data is produced by google earth when exporting a set of locations as a .kml file.

&lt?xml version=”1.0″ encoding=”utf-8″?&ampgt
&ltkml xmlns=”http://www.opengis.net/kml/2.2″ xmlns:atom=”http://www.w3.org/2005/Atom” xmlns:gx=”http://www.google.com/kml/ext/2.2″ xmlns:kml=”http://www.opengis.net/kml/2.2″&ampgt

&ltDocument&ampgt

&ltname&ampgtCanale centreline.kml&lt/name&ampgt

&ltPlacemark&ampgt

&ltname&ampgtCanale centreline&lt/name&ampgt

&ltstyleUrl&ampgt#m_ylw-pushpin&lt/styleUrl&ampgt

&ltLineString&ampgt

&lttessellate&ampgt1&lt/tessellate&ampgt

&ltcoordinates&ampgt
10.09984114068684,45.80687219300302,0 10.10001647099475,45.80695514950003,0 10.10009815060466,45.80700378862793,0 10.10014860229519,45.80703578631482,0 10.10022811504785,45.80709873377793,0 10.10031010039278,45.80713198033737,0 10.10039209567001,45.80716523014203,0 10.10043060939518,45.80721155298124,0 10.10059165048882,45.80731526414375,0 10.10067084675329,45.80738580284769,0 10.10077437654077,45.80742018705158,0 10.10087407286891,45.80749903339635,0 10.10094496915137,45.80753930965439,0 10.10105627273914,45.8076036518462,0 10.10113576682277,45.80766657410192,0 10.10119471893764,45.80772115465624,0 10.10126359962793,45.80778359269938,0 10.10136426386506,45.80784743536302,0 10.1014112702849,45.80791630897777,0 10.10148214037166,45.80795656414318,0 10.10154185964934,45.80800378862803,0 10.10161282555237,45.80804408470648,0 10.10169305938987,45.8080996247984,0 10.10176464185466,45.8081325076318,0 10.10183422692097,45.80818754113409,0 10.10190302533531,45.80824989951307,0 10.1019724151388,45.80830484378664,0 10.10207356450982,45.808361196677,0 10.10214093307033,45.80843827920521,0 10.10218914996636,45.80849228359421,0 10.10229299490651,45.80851911841853,0 10.10237499281356,45.8085523932947,0 10.10242320989466,45.80860639765308,0 10.10253629440558,45.80864846068262,0 10.10261559800962,45.80871125354677,0 10.10273070367636,45.80873117766292,0 10.10279152902868,45.80876351326417,0 10.10288478779878,45.80878987758641,0 10.1029859390649,45.80884622983381,0 10.10308911157812,45.80888044325135,0 10.10321278369886,45.80892297497288,0 10.10330469593155,45.80896409855487,0 10.10338602169821,45.80900475219077,0 10.1035831279132,45.80905795013989,0 10.10369554080581,45.8091073915239,0 10.10380930078051,45.80914207396828,0 10.1039243924581,45.80916198901878,0 10.10410953675727,45.80922946173238,0 10.1042655887779,45.80926599718044,0 10.10448377732788,45.80932009535048,0 10.10464581905896,45.80937169685852,0 10.10519473741717,45.80950956495738,0 10.10612207616,45.80976848013496,0 10.10668667174301,45.80991931642902,0 10.10735617652176,45.81011904635547,0 10.10755079422222,45.81019061814287,0 10.10780348654802,45.81027639187093,0 10.10803597017285,45.81037553650121,0 10.10821951016236,45.81046710032802,0 10.10838297427577,45.8105858071117,0 10.10848814678001,45.81069003563514,0 10.10861187605076,45.8108356424888,0 10.10876378535779,45.81102960623957,0 10.10879264406763,45.81105054198222,0 10.10880175134389,45.81107809248236,0 10.10883039778439,45.81110587300541,0 10.10885887582951,45.81114051062271,0 10.10886831282799,45.81115434247666,0 10.10890672473159,45.8111822327723,0 10.10893519903836,45.8112168662463,0 10.1089540707507,45.81124452682191,0 10.10897294201804,45.8112721866723,0 10.10899181299001,45.81129984645426,0 10.10901068384033,45.81132750526101,0 10.10903931941305,45.81135527398825,0 10.1090678193733,45.81138299496511,0 10.10908672014586,45.81140374559718,0 10.10910543183352,45.8114313447435,0 10.1091438470374,45.81145922269521,0 10.10916306040502,45.8114869991964,0
&lt/coordinates&ampgt

&lt/LineString&ampgt

&lt/Placemark&ampgt

&lt/Document&ampgt

&lt/kml&ampgt

Ah, I get it now. This is because you have a xmlns in your docnode. The Java XPath implementation doesn’t know how to do this on its own, so you need to supply a NamespaceContext.

Try the following. Save this code as KMLNamesspaceContext.java:

import java.util.*;
import javax.xml.*;
import javax.xml.namespace.NamespaceContext;

public class KMLNamesspaceContext implements NamespaceContext {

    public String getNamespaceURI(String prefix) {
        if (prefix == null) throw new NullPointerException("Null prefix");
        else if ("kml".equals(prefix)) return "http://www.opengis.net/kml/2.2";
        else if ("xml".equals(prefix)) return XMLConstants.XML_NS_URI;
        return XMLConstants.NULL_NS_URI;
    }

    public String getPrefix(String uri) {
        throw new UnsupportedOperationException();
    }

    public Iterator getPrefixes(String uri) {
        throw new UnsupportedOperationException();
    }

}

Then in MATLAB, complile this to a class:

!javac ExampleNamespaceContext.java
javaaddpath(pwd)
nc = KMLNamesspaceContext

Now use that with your expression:

factory = XPathFactory.newInstance;
xpath = factory.newXPath;
xpath.setNamespaceContext(nc);
expression = xpath.compile('//kml:Document')
expression.evaluate(docNode, XPathConstants.NODE)

Hi Michael,

Thanks again for the help.

I still get the same error of

coordinateNode =
[]
??? Attempt to reference field of non-structure array.

Error in ==> xpath_setup at 34
data = coordinateNode.getTextContent .

I have implemented your suggestion as

!/opt/sunjava-native/jdk/bin/javac KMLNamesspaceContext.java
javaaddpath(pwd)
nc = KMLNamesspaceContext

filename=’/home/raid/itg/carh5/data_drive/iseo_field_data/bathymetry/Canale_centreline_googleearth.kml';
docNode = xmlread(filename);
documentNode = docNode.getDocumentElement

import javax.xml.xpath.*
factory = XPathFactory.newInstance;
xpath = factory.newXPath;
xpath.setNamespaceContext(nc);
expression = xpath.compile(‘//kml:Documents’)
coordinateNode=expression.evaluate(docNode, XPathConstants.NODE)
data = coordinateNode.getTextContent

I have a work around based on http://www.mathworks.co.uk/help/techdoc/ref/xmlread.html

I use this example to write turn the xml file into a struct from which I can extract the data I want. This is not ideal and xpath looks much smoother, if I could get it to work.

@Charlie,

Sorry my example did not fully cover your need. You’ll probably need something like:

expression = xpath.compile('//kml://Placemark/LineString/coordinates');

But I’m not too sure. If you’re still having trouble, please contact Technical Support. They will be able to give you more assistance than I am able to do in the comments, here.

Thanks Michael. Maybe this function was a bit of a leap for me at the moment.

Could you suggest somewhere else that I could find documentation and examples of how to use the xpath functions? The http://xerces.apache.org/xerces-j/apiDocs/index.html pages are pretty much unintelligible to me. For example, how could I find out what arguments expression.evaluate() requires? There is no help in the matlab documentation I can see.

@Charlie,

Unfortunately there’s not, mostly because this is a third party java package that we make available. Even the documentation at http://www.oxygenxml.com/apidoc/saxon-8.7.1/index.html is not very helpful. There are other sources of xpath tutorials on the net, like http://w3schools.com/xpath/xpath_syntax.asp, but are not specific to the java implementation.

You might also want to try the file exchange and MATLAB answers to see if there are more specific advice available. This also sounds like a good idea for a follow-up post. Let me know how it goes.

Hi Mike,
Thanks for the article. Really helped as I could easily create a new xml file. However, navigating through an existing xml file, I was wondering if we could edit the text content of a node, or the node name itself for that matter.

@Nikhil,
You can edit the text content with the setTextContent method. To make minor edits, use getTextContent to copy the String, modify it, and then replace it back. As for changing the node name, the DOM API does not specify a way to do this. You’ll have to create a new element with the desired name and move the first node’s children to the new element. And replace the first node with the new one.

Hey thanks a lot Mike! It worked out just well. I was initially editing with the tree structure, but XPath was so much smoother. Cheers.

Hello, I have this:

<TestStep comp=”GELE” datatype=”Number” group=”[010] Seq_Short Test” limhi=”0.002″ limlo=”-0.002″ measid=”010_272″ measname=”PinCheck ST_O_OUT_1″ start=”88079.574242300005″ status=”Passed” stepid=”ID#:MOsSkDksp0KkG29Tfvr4MC” stepname=”[010_272] PinCheck ST_O_OUT_1″ steptype=”NumericLimitTest” time=”0.0034241″ unit=”ampere” value=”-0.0000037545442″/>’

How I can get data from this with xpath???

Thanks for the series Michael.
I have trouble duplicating your code in this piece.
I start with the file saved in the working folder, then:

clear; clc
xmlFileName = ‘phoneBook.xml';
docNode = xmlread(xmlFileName);
addressBookNode = docNode.getDocumentElement;
entries = addressBookNode.getChildNodes;
friendlyInfo = entries.item(0).getChildNodes;
node = friendlyInfo.getFirstChild;
while ~isempty(node)
if strcmpi(node.getNodeName, ‘PhoneNumber’)
break;
else
node = node.getNextSibling;
end
end
phoneNumber = node.getTextContent

except for the first 3 lines, this is identical to your sample.
matlab reports this error on line 7:
??? No appropriate method, property, or field getFirstChild for class
org.apache.xerces.dom.CharacterDataImpl$1.

if I run through line 5 and evaluate the first part of line 6, I get:
>> friendlyInfo = entries.item(0)
friendlyInfo =
[#text:
]

but evaluating all of line 6 produces:
org.apache.xerces.dom.CharacterDataImpl$1@6231ed

What is going on?
Thanks,
John

These postings are the author's and don't necessarily represent the opinions of MathWorks.