Querying an XML document using XMLStarlet

September 26, 2007

I use Maven to manage my Java build process. Maven, like Ant, uses XML to store information about your project and how to build it.

Yesterday, a thought occurred to me – What if I wanted to write a script that would do post-packaging of the Maven-built artifact(s)?

For example, what if I wanted to take the Maven-built artifact and distribute it via an OS X disk image? Or even something as simple as packaging the source into a tar.gz?

A simple shell script would be easy enough to cook up – but without repeating myself, how could I do it such that the subsequent packages are named according to the same version declared in the Maven pom.xml file?

This got me to thinking about how to query XML, possibly using XPath from a shell script or the command line. I thought I might have to roll out my own XML-XPath command line processor, but fortunately somebody else beat me to it (thanks, God for all the wonderful people on teh internets) with XMLStarlet.

Maven pom.xml excerpt

To illustrate, here’s an excerpt from a hypothetical Maven pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi=…>
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.foo</groupId>
  <artifactId>iwidget</artifactId>
  <packaging>jar</packaging>
  <version>0.9.1b</version>
  <name>iwidget</name>
  <url>http://maven.apache.org</url>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
  </dependencies>

…

I snipped of the rest of the root element’s xmlns* attributes and the rest of the file for brevity.

Let’s say I want to extract the value of the version element, and use that to, say, distribute the entire project source as iwidget-0.9.1b-src.tar.gz.

While at first clever stream text processing or pattern matching commands (the easiest I can come up with is cat pom.xml|grep "version"|sed -e "s,\( *<version>\)\(.*\)\(</version>\),\2,") could do it – upon closer inspection there are two lines that correspond to the search pattern, and we’re only interested in the first one (the artifact version, not the version of the dependency).

I could probably wrestle some more with sed, awk or grep and come up with a working solution – or I could just bite the bullet and process XML as XML.

XMLStarlet

The XMLStarlet home page says it all:

XMLStarlet is a set of command line utilities (tools) which can be used to transform, query, validate, and edit XML documents and files using simple set of shell commands in similar way it is done for plain text files using UNIX grep, sed, awk, diff, patch, join, etc commands.

This set of command line utilities can be used by those who deal with many XML documents on UNIX shell command prompt as well as for automated XML processing with shell scripts.

Because I have MacPorts (I just can’t say enough about how incredibly useful MacPorts is!), installing XMLStarlet on my Mac couldn’t have been any easier:

sudo port install xmlstarlet

Now all that’s left is to figure out the tool. man xmlstarlet tells us that the command to select or query an XML document is xmlstarlet sel.

The XMLStarlet one-liner

To solve an XML problem, throw more XML at it.

Which is what XMLStarlet does behind the scenes, actually. When we use the xmlstarlet sel … command to select/query our XML document, XMLStarlet actually creates an XSLT for us (which you can see by using the -C option).

Fortunately xmlstartlet sel lets us use the more convenient XPath syntax rather than XSLT. According to the docs, the path to the node we’re interested in is /project/version.

Of course, that didn’t work.

After mucking around for a good half hour trying to figure if my XPath was correct, or if my command line syntax was correct, I created a simple XML document just to isolate things. I finally discovered that because the Maven POM uses a default XML namespace this was throwing my query off. This is apparently a well-known problem but it’s listed toward the end of the XMLStarlet User’s Guide where it’s very easy to miss.

The final solution

To extract the artifact version from Maven’s pom.xml using XMLStarlet:

xmlstarlet sel -N x=http://maven.apache.org/POM/4.0.0 \
-t -v "/x:project/x:version" pom.xml

It’s easy enough to see how the above can be modified to extract whatever other values we might need from any XML file. Armed we all of the above, it’s also academic to write more sophisticated scripts that use the information already contained in Maven’s pom.xml – keeping everything nice and DRY.

Advertisements

8 Responses to “Querying an XML document using XMLStarlet”

  1. aquabot Says:

    Thanks for this indepth analysis; all I know if it is that this set of command line utilities can be used by those who deal with many XML documents on UNIX shell command prompt as well as for automated XML processing with shell scripts. I bookmark it..It really helps novices as me to understand the concept better….

  2. Ankur Says:

    Just wasted a lot of time myself and basically did the same steps as you did, but didn’t know how to get xmlstarlet to actually recognize the pom.xml format (It said it was valid!!)

    Thanks for this post!

  3. Bart Says:

    Thanks for this enlightening post, I was struggling to get xmlstarlet to process my poms!

  4. prad Says:

    what if i have multiple namespaces and want to query the id in this xml file:

    Jan 14, 2010 2:33:46 PM
    Jan 14, 2010 2:33:46 PM

    100

  5. Ben Tyger Says:

    Thanks for the post. I beat me head on my desk for almost two days on this damn issue.

  6. Oliver Says:

    Great minds think alike 🙂 Thanks for solving the last piece of my maven xmlstarlet puzzle!

  7. robles Says:

    Thanks a ton! the “None of the XPaths matched” error was turning me mad. I also checked the section 5.1 of the manual, but wasn’t building my query correctly apparently. Yours helped me intensely. Thanks!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: