Gnome LogoW3C LogoRed Hat Logo

The XML C library for Gnome

Validation & DTDs

Main Menu
API Indexes
Related links

Table of Content:

  1. General overview
  2. The definition
  3. Simple rules
    1. How to reference a DTD from a document
    2. Declaring elements
    3. Declaring attributes
  4. Some examples
  5. How to validate
  6. Other resources

General overview

Well what is validation and what is a DTD ?

DTD is the acronym for Document Type Definition. This is a description of the content for a familly of XML files. This is part of the XML 1.0 specification, and alows to describe and check that a given document instance conforms to a set of rules detailing its structure and content.

Validation is the process of checking a document against a DTD (more generally against a set of construction rules).

The validation process and building DTDs are the two most difficult parts of the XML life cycle. Briefly a DTD defines all the possibles element to be found within your document, what is the formal shape of your document tree (by defining the allowed content of an element, either text, a regular expression for the allowed list of children, or mixed content i.e. both text and children). The DTD also defines the allowed attributes for all elements and the types of the attributes.

The definition

The W3C XML Recommendation (Tim Bray's annotated version of Rev1):

(unfortunately) all this is inherited from the SGML world, the syntax is ancient...

Simple rules

Writing DTD can be done in multiple ways, the rules to build them if you need something fixed or something which can evolve over time can be radically different. Really complex DTD like Docbook ones are flexible but quite harder to design. I will just focuse on DTDs for a formats with a fixed simple structure. It is just a set of basic rules, and definitely not exhaustive nor useable for complex DTD design.

How to reference a DTD from a document:

Assuming the top element of the document is spec and the dtd is placed in the file mydtd in the subdirectory dtds of the directory from where the document were loaded:

<!DOCTYPE spec SYSTEM "dtds/mydtd">

Notes:

  • the system string is actually an URI-Reference (as defined in RFC 2396) so you can use a full URL string indicating the location of your DTD on the Web, this is a really good thing to do if you want others to validate your document
  • it is also possible to associate a PUBLIC identifier (a magic string) so that the DTd is looked up in catalogs on the client side without having to locate it on the web
  • a dtd contains a set of elements and attributes declarations, but they don't define what the root of the document should be. This is explicitely told to the parser/validator as the first element of the DOCTYPE declaration.

Declaring elements:

The following declares an element spec:

<!ELEMENT spec (front, body, back?)>

it also expresses that the spec element contains one front, one body and one optionnal back children elements in this order. The declaration of one element of the structure and its content are done in a single declaration. Similary the following declares div1 elements:

<!ELEMENT div1 (head, (p | list | note)*, div2?)>

means div1 contains one head then a series of optional p, lists and notes and then an optional div2. And last but not least an element can contain text:

<!ELEMENT b (#PCDATA)>

b contains text or being of mixed content (text and elements in no particular order):

<!ELEMENT p (#PCDATA|a|ul|b|i|em)*>

p can contain text or a, ul, b, i or em elements in no particular order.

Declaring attributes:

again the attributes declaration includes their content definition:

<!ATTLIST termdef name CDATA #IMPLIED>

means that the element termdef can have a name attribute containing text (CDATA) and which is optionnal (#IMPLIED). The attribute value can also be defined within a set:

<!ATTLIST list type (bullets|ordered|glossary) "ordered">

means list element have a type attribute with 3 allowed values "bullets", "ordered" or "glossary" and which default to "ordered" if the attribute is not explicitely specified.

The content type of an attribute can be text (CDATA), anchor/reference/references (ID/IDREF/IDREFS), entity(ies) (ENTITY/ENTITIES) or name(s) (NMTOKEN/NMTOKENS). The following defines that a chapter element can have an optional id attribute of type ID, usable for reference from attribute of type IDREF:

<!ATTLIST chapter id ID #IMPLIED>

The last value of an attribute definition can be #REQUIRED meaning that the attribute has to be given, #IMPLIED meaning that it is optional, or the default value (possibly prefixed by #FIXED if it is the only allowed).

Notes:

  • usually the attributes pertaining to a given element are declared in a single expression, but it is just a convention adopted by a lot of DTD writers:
    <!ATTLIST termdef
              id      ID      #REQUIRED
              name    CDATA   #IMPLIED>

    The previous construct defines both id and name attributes for the element termdef

Some examples

The directory test/valid/dtds/ in the libxml distribution contains some complex DTD examples. The test/valid/dia.xml example shows an XML file where the simple DTD is directly included within the document.

How to validate

The simplest is to use the xmllint program comming with libxml. The --valid option turn on validation of the files given as input, for example the following validates a copy of the first revision of the XML 1.0 specification:

xmllint --valid --noout test/valid/REC-xml-19980210.xml

the -- noout is used to not output the resulting tree.

The --dtdvalid dtd allows to validate the document(s) against a given DTD.

Libxml exports an API to handle DTDs and validation, check the associated description.

Other resources

DTDs are as old as SGML. So there may be a number of examples on-line, I will just list one for now, others pointers welcome:

I suggest looking at the examples found under test/valid/dtd and any of the large number of books available on XML. The dia example in test/valid should be both simple and complete enough to allow you to build your own.

Daniel Veillard