Libxml2 XmlTextReader Interface tutorial

This document describes the use of the XmlTextReader streaming API added to libxml2 in version 2.5.0 . This API is closely modeled after the XmlTextReader and XmlReader classes of the C# language.

This tutorial will present the key points of this API, and working examples using both C and the Python bindings:

Table of content:

Introduction: why a new API

Libxml2 main API is tree based, where the parsing operation results in a document loaded completely in memory, and expose it as a tree of nodes all availble at the same time. This is very simple and quite powerful, but has the major limitation that the size of the document that can be hamdled is limited by the size of the memory available. Libxml2 also provide a SAX based API, but that version was designed upon one of the early expat version of SAX, SAX is also not formally defined for C. SAX basically work by registering callbacks which are called directly by the parser as it progresses through the document streams. The problem is that this programming model is relatively complex, not well standardized, cannot provide validation directly, makes entity, namespace and base processing relatively hard.

The XmlTextReader API from C# provides a far simpler programming model, the API act as a cursor going forward on the document stream and stopping at each node in the way. The user code keep the control of the progresses and simply call a Read() function repeatedly to progress to each node in sequence in document order. There is direct support for namespaces, xml:base, entity handling and adding DTD validation on top of it was relatively simple. This API is really close to the DOM Core specification This provides a far more standard, easy to use and powerful API than the existing SAX. Moreover integrating extension feature based on the tree seems relatively easy.

In a nutshell the XmlTextReader API provides a simpler, more standard and more extensible interface to handle large document than the existing SAX version.

Walking a simple tree

Basically the XmlTextReader API is a forward only tree walking interface. The basic steps are:

  1. prepare a reader context operating on some input
  2. run a loop iterating over all nodes in the document
  3. free up the reader context

Here is a basic C sample doing this:

#include <libxml/xmlreader.h>

void processNode(xmlTextReaderPtr reader) {
    /* handling of a node in the tree */
}

int streamFile(char *filename) {
    xmlTextReaderPtr reader;
    int ret;

    reader = xmlNewTextReaderFilename(filename);
    if (reader != NULL) {
        ret = xmlTextReaderRead(reader);
        while (ret == 1) {
            processNode(reader);
            ret = xmlTextReaderRead(reader);
        }
        xmlFreeTextReader(reader);
        if (ret != 0) {
            printf("%s : failed to parse\n", filename);
        }
    } else {
        printf("Unable to open %s\n", filename);
    }
}

A few things to notice:

Here is a similar code in python for exactly the same processing:

import libxml2

def processNode(reader):
    pass

def streamFile(filename):
    try:
        reader = libxml2.newTextReaderFilename(filename)
    except:
        print "unable to open %s" % (filename)
        return

    ret = reader.Read()
    while ret == 1:
        processNode(reader)
        ret = reader.Read()

    if ret != 0:
        print "%s : failed to parse" % (filename)

The only things worth adding are that the xmlTextReader is abstracted as a class like in C# with the same method names (but the properties are currently accessed with methods) and that one doesn't need to free the reader at the end of the processing, it will get garbage collected once all references have disapeared

Extracting informations for the current node

So far the example code did not indicate how informations were extracted from the reader, it was abstrated as a call to the processNode() routine, with the reader as the argument. At each invocation, the parser is stopped on a given node and the reader can be used to query those node properties. Each Property is available at the C level as a function taking a single xmlTextReaderPtr argument whose name is xmlTextReaderProperty , if the return type is an xmlChar * string then it must be deallocated with xmlFree() to avoid leaks. For the Python interface, there is a Property method to the reader class that can be called on the instance. The list of the properties is based on the C# XmlTextReader class set of properties and methods:

Let's look first at a small example to get this in practice by redefining the processNode() function in the Python example:

def processNode(reader):
    print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
                           reader.Name(), reader.IsEmptyElement())

and look at the result of calling streamFile("tst.xml") for various content of the XML test file.

For the minimal document "<doc/>" we get:

0 1 doc 1

Only one node is found, its depth is 0, type 1 indocate an element start, of name "doc" and it is empty. Trying now with "<doc></doc>" instead leads to:

0 1 doc 0
0 15 doc 0

The document root node is not flagged as empty anymore and both a start and an end of element are detected. The following document shows how character data are reported:

<doc><a/><b>some text</b>
<c/></doc>

We modifying the processNode() function to also report the node Value:

def processNode(reader):
    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
                              reader.Name(), reader.IsEmptyElement(),
                              reader.Value())

The result of the test is:

0 1 doc 0 None
1 1 a 1 None
1 1 b 0 None
2 3 #text 0 some text
1 15 b 0 None
1 3 #text 0

1 1 c 1 None
0 15 doc 0 None

There is a few things to note:

The equivalent routine for processNode() as used by xmllint --stream --debug is the following and can be found in the xmllint.c module in the source distribution:

static void processNode(xmlTextReaderPtr reader) {
    xmlChar *name, *value;

    name = xmlTextReaderName(reader);
    if (name == NULL)
        name = xmlStrdup(BAD_CAST "--");
    value = xmlTextReaderValue(reader);

    printf("%d %d %s %d",
            xmlTextReaderDepth(reader),
            xmlTextReaderNodeType(reader),
            name,
            xmlTextReaderIsEmptyElement(reader));
    xmlFree(name);
    if (value == NULL)
        printf("\n");
    else {
        printf(" %s\n", value);
        xmlFree(value);
    }
}

Extracting informations for the attributes

The previous examples don't indicate how attributes are processed. The simple test "<doc a="b"/>" provides the following result:

0 1 doc 1 None

This prove that attributes nodes are not traversed by default. The HasAttributes property allow to detect their presence. To check their content the API has special instructions basically 2 kind of operations are possible:

  1. to move the reader to the attribute nodes of the current element, in that case the cursor is positionned on the attribute node
  2. to directly query the element node for the attribute value

In both case the attribute can be designed either by its position in the list of attribute (MoveToAttributeNo or GetAttributeNo) or by their name (and namespace):

After modifying the processNode() function to show attributes:

def processNode(reader):
    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
                              reader.Name(), reader.IsEmptyElement(),
                              reader.Value())
    if reader.NodeType() == 1: # Element
        while reader.MoveToNextAttribute():
            print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
                                          reader.Name(),reader.Value())

the output for the same input document reflects the attribute:

0 1 doc 1 None
-- 1 2 (a) [b]

There is a couple of things to note on the attribute processing:

Validating a document

Libxml2 implementation adds some extra feature on top of the XmlTextReader API, the main one is the ability to DTD validate the parsed document progressively. This is simply the activation of the associated feature of the parser used by the reader structure. There are a few options available defined as the enum xmlParserProperties in the libxml/xmlreader.h header file:

The GetParserProp() and SetParserProp() methods can then be used to get and set the values of those parser properties of the reader. For example

def parseAndValidate(file):
    reader = libxml2.newTextReaderFilename(file)
    reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
    ret = reader.Read()
    while ret == 1:
        ret = reader.Read()
    if ret != 0:
        print "Error parsing and validating %s" % (file)

This routine will parse and validate the file. Errors message can be captured by registering an error handler. See python/tests/reader2.py for more complete Python examples. At the C level the equivalent call to cativate the validation feature is just:

ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)

and a return value of 0 indicates success.

Entities substitution

By default the xmlReader will report entities as such and not replace them with their content. This default behaviour can however be overriden using:

reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)

Relax-NG Validation

Introduced in version 2.5.7

Libxml2 can now validate the document being read using the xmlReader using Relax-NG schemas. While the Relax NG validator can't always work in a streamable mode, only subsets which cannot be reduced to regular expressions need to have their subtree expanded for validation. In practice it means that, unless the schemas for the top level element content is not expressable as a regexp, only chunk of the document needs to be parsed while validating.

The steps to do so are:

Example, assuming the reader has already being created and that the schema string contains the Relax-NG schemas:

rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))
rngs = rngp.relaxNGParse()
reader.RelaxNGSetSchema(rngs)
ret = reader.Read()
while ret == 1:
ret = reader.Read()
if ret != 0:
print "Error parsing the document"
if reader.IsValid() != 1:
print "Document failed to validate"

See reader6.py in the sources or documentation for a complete example.

Mixing the reader and tree or XPath operations

Introduced in version 2.5.7

While the reader is a streaming interface, its underlying implementation is based on the DOM builder of libxml2. As a result it is relatively simple to mix operations based on both models under some constraints. To do so the reader has an Expand() operation allowing to grow the subtree under the current node. It returns a pointer to a standard node which can be manipulated in the usual ways. The node will get all its ancestors and the full subtree available. Usual operations like XPath queries can be used on that reduced view of the document. Here is an example extracted from reader5.py in the sources which extract and prints the bibliography for the "Dragon" compiler book from the XML 1.0 recommendation:

f = open('../../test/valid/REC-xml-19980210.xml')
input = libxml2.inputBuffer(f)
reader = input.newTextReader("REC")
res=""
while reader.Read():
    while reader.Name() == 'bibl':
        node = reader.Expand()            # expand the subtree
        if node.xpathEval("@id = 'Aho'"): # use XPath on it
            res = res + node.serialize()
        if reader.Next() != 1:            # skip the subtree
            break;

Note however that the node instance returned by the Expand() call is only valid until the next Read() operation. The Expand() operation does not affects the Read() ones, however usually once processed the full subtree is not useful anymore, and the Next() operation allows to skip it completely and process to the successor or return 0 if the document end is reached.

Daniel Veillard

$Id$