Querying an XML document using XMLStarlet
September 26, 2007
I use Maven to manage my Java build process. Maven, like Ant, uses XML to store information about your project and how to build it.
Yesterday, a thought occurred to me – What if I wanted to write a script that would do post-packaging of the Maven-built artifact(s)?
For example, what if I wanted to take the Maven-built artifact and distribute it via an OS X disk image? Or even something as simple as packaging the source into a
A simple shell script would be easy enough to cook up – but without repeating myself, how could I do it such that the subsequent packages are named according to the same version declared in the Maven
This got me to thinking about how to query XML, possibly using XPath from a shell script or the command line. I thought I might have to roll out my own XML-XPath command line processor, but fortunately somebody else beat me to it (thanks, God for all the wonderful people on teh internets) with XMLStarlet.
To illustrate, here’s an excerpt from a hypothetical Maven
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi=…> <modelVersion>4.0.0</modelVersion> <groupId>com.foo</groupId> <artifactId>iwidget</artifactId> <packaging>jar</packaging> <version>0.9.1b</version> <name>iwidget</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> </dependencies> …
I snipped of the rest of the root element’s
xmlns* attributes and the rest of the file for brevity.
Let’s say I want to extract the value of the version element, and use that to, say, distribute the entire project source as
While at first clever stream text processing or pattern matching commands (the easiest I can come up with is
cat pom.xml|grep "version"|sed -e "s,\( *<version>\)\(.*\)\(</version>\),\2,") could do it – upon closer inspection there are two lines that correspond to the search pattern, and we’re only interested in the first one (the artifact version, not the version of the dependency).
I could probably wrestle some more with
grep and come up with a working solution – or I could just bite the bullet and process XML as XML.
The XMLStarlet home page says it all:
XMLStarlet is a set of command line utilities (tools) which can be used to transform, query, validate, and edit XML documents and files using simple set of shell commands in similar way it is done for plain text files using UNIX grep, sed, awk, diff, patch, join, etc commands.
This set of command line utilities can be used by those who deal with many XML documents on UNIX shell command prompt as well as for automated XML processing with shell scripts.
Because I have MacPorts (I just can’t say enough about how incredibly useful MacPorts is!), installing XMLStarlet on my Mac couldn’t have been any easier:
sudo port install xmlstarlet
Now all that’s left is to figure out the tool.
man xmlstarlet tells us that the command to select or query an XML document is
The XMLStarlet one-liner
To solve an XML problem, throw more XML at it.
Which is what XMLStarlet does behind the scenes, actually. When we use the
xmlstarlet sel … command to select/query our XML document, XMLStarlet actually creates an XSLT for us (which you can see by using the
xmlstartlet sel lets us use the more convenient XPath syntax rather than XSLT. According to the docs, the path to the node we’re interested in is
Of course, that didn’t work.
After mucking around for a good half hour trying to figure if my XPath was correct, or if my command line syntax was correct, I created a simple XML document just to isolate things. I finally discovered that because the Maven POM uses a default XML namespace this was throwing my query off. This is apparently a well-known problem but it’s listed toward the end of the XMLStarlet User’s Guide where it’s very easy to miss.
The final solution
To extract the artifact version from Maven’s
pom.xml using XMLStarlet:
xmlstarlet sel -N x=http://maven.apache.org/POM/4.0.0 \ -t -v "/x:project/x:version" pom.xml
It’s easy enough to see how the above can be modified to extract whatever other values we might need from any XML file. Armed we all of the above, it’s also academic to write more sophisticated scripts that use the information already contained in Maven’s
pom.xml – keeping everything nice and DRY.