Generating XML With Genx

Author: Tim Bray
Date: Jan-May, 2004
Locations: Vancouver, Melbourne, Mooloolaba

Status of Genx

This document describes release beta5 of Genx.

Genx may not remain hosted at wherever you got this file from, and is quite likely to change and grow based on community feedback. You’ve been warned!

Genx is copyright © Tim Bray and Sun Microsystems, 2004. It is licensed for re-use under the terms described in the file COPYING.

Introduction

Genx is a software library written in the C programming language. You can use it to generate XML output, suitable for saving into a file or sending in a message to another computer program. Genx does several things at once:

Table of Contents

  1. Hello World!

  2. API Overview

  3. Errors

  4. Limitations

  5. Canonical XML

  6. Performance

  7. Declaration Index

  8. API Index

  9. Acknowledgments

Hello World!

Here’s the program:

hello.c

Compile it with something like cc -o hello hello.c libgenx.a and the output should look like this:

hello

Of course, useful XML documents have attributes as well as elements, so let’s add one:

helloAttr.c

This generates:

helloAttr

Another common XML idiom is namespaces, so let’s put our element and attribute into two separate namespaces.

helloNS.c

This makes the output quite a bit uglier:

helloNS

Passing all these literal strings for element types and attribute names and so on is inefficient, particularly since they usually don’t change much. So if you wanted to generate a million random year/month combinations efficiently as in the example below, you’d use the predeclared versions of the Genx calls. Also, if something goes wrong, you’d like to hear about it before looping a million times uselessly; so this version has error-checking.

Also, I’ve put the root element in a namespace so you can see how that works.

helloMillion.c

Here are the first 10 lines of output:

helloMillion | head -10

API Overview

genxWriter

Before you do anything, you need to create a genxWriter with genxNew. A genxWriter can be used to generate as many XML documents as you want (one at a time). It’s a bit expensive to create, so if you’re going to be writing multiple XML documents, particularly if they all have the same elements and attributes, do re-use a genxWriter.

Predeclaration

Declaring your elements and attributes is much more efficient than using the Literal versions of the calls. This is because Genx only needs to check the names once for well-formedness, and because it can pre-arrange the sorting of attributes in canonical order. Also, Genx makes its own copy of the element, attribute and namespace names and prefixes and so on, so you don’t have to keep them around. For any production application, predeclaration is the way to go.

Files and Senders

Once you’ve got a genxWriter, you set up to write a document either with genxStartDocFile or genxStartDocSender. The first is easiest to understand; you provide a FILE *, and Genx writes into it.

Alternatively, you can provide your own set of routines to do output, for example into a relational database or a socket, in a package called a genxSender, and Genx uses that instead.

Sequencing

Once you’ve got your elements, attributes, and namespaces declared, you start new documents with genxStartDocFile or genxStartDocSender, then you can just bang away with genxStartElement, genxAddAttribute, genxAddText, genxEndElement, and so on, and end each document with genxEndDoc.

UTF-8

Genx expects you to provide all strings in UTF-8 format, and checks each one to make sure that it’s real UTF-8 and that each character is a legal XML character. It doesn’t know about &lt; and &amp; and so on; that is, it knows how to generate them, but it won’t interpret them in the input. So if you want to say if(a<b&&c<d), don’t fool with any escaping, just use genxAddText(w,"if(a<b&&c<d)") and Genx will sort it all out.

If there is some “difficult” character that you want to get into your XML output, say a mathematical integral symbol “∫”, and you’d really like the equivalent of &int; or &#x222b;, just use the Unicode value: genxAddCharacter(w,0x222b).

Namespace Prefixes

You can control your namespace prefixes if you use the predeclared version. But you can always leave out the prefix and Genx will generate one; the first will be g1:, the second g2:, and so on.

Errors

Mechanics

Genx provides a set of status codes in an enum called genxStatus. The value for success, GENX_SUCCESS, is guaranteed to be zero, so it’s easy to check errors in Genx calls along the lines of:

if (genxAddAttribute(id, idValue) { /* oops! */ }

Well, except when it isn’t. The routines that declare things return the things they declare (NULL on error) and write the genxStatus into a variable whose address you provide, for example genxElement genxDeclareElement(genxWriter w, genxNamespace ns, constUtf8 type, genxStatus * statusP);

There are a couple of routines, genxGetErrorMessage and genxLastErrorMessage, which retrieve English-language descriptions of what went wrong.

Kinds of Errors

There are three kinds of errors you can encounter with Genx.

Stupidity

We all have reduced-mental-function days, and Genx will sneer pityingly at you if you try to genxStartElement without having previously done a GenxStartDoc call, or do a genxAddAttribute any time but after a genxStartElement. And so on.

Bad Data

This is the kind of problem that you’re most likely to run across. If you’re trying to wrap XML tags around input data you don’t control (common enough), Genx will be unhappy if the data has malformed UTF-8 or contains Unicode characters that XML doesn’t allow.

To help out with these situations, there are the genxCheckText and genxScrubText calls. Appropriate use of these ensures that you never hurt any feelings, either in the Genx software or, more important, with whoever’s going to be receiving your XML. See the write-up on utility routines for some specific suggestions.

System Problems

Genx throws up its hands in despair if it can’t allocate memory or it gets an I/O error writing data. The first is unlikely to happen, since Genx doesn’t use much memory. However, it does store up attribute values per element, so if you did a thousand or so genxAddAttribute calls for a single element, each with an attribute value ten megabytes long, some pain would ensue.

Utility Routines

To make sure you never hand Genx an illegal name or malformed XML, there are the handy utility routines genxCheckText and genxCheckName. If you’re including someone else’s data in your XML and you can’t control whether it contains proper XML characters properly UTF-8 encoded,give serious thought to using genxScrubText, which brutally discards any bytes that aren’t well-formed UTF-8 or don’t encode legal XML characters.

Since genxAddText does the checking anyhow, there’s no need for you to do it first. Consider an idiom like:


/* Add text safely */
status = genxAddText(w, text);
if (status == GENX_BAD_UTF8 || status == genx_NON_XML_CHAR)
{
  constUtf8 newText = (constUtf8) alloca(strlen(text) + 1);
  genxScrubText(text, newText);
  status = genxAddText(w, newText); /* Can't fail */
}
if (status) /* something SERIOUSLY wrong */

Limitations

There are a bunch of things that people often do in creating XML but that Genx doesn’t support. In some cases, Doing These Things Would Be Wrong. In others, they might be handy but don’t feel essential for this kind of a low-rent package.

The things that Genx can’t do include:

Canonical XML

By design, Genx writes Canonical XML. This means that there are no XML or <!DOCTYPE> declarations, that the attributes are sorted in a particular order, that all instances of > and carriage-return (U+000D) are escaped, and that there is no whitespace outside the root element except newlines that precede and follow comments and PIs.

Normally, this should cause no surprises or difficulties, except that Canonical XML documents don’t have a closing new-line character, which may irritate some applications such as text editors.

As noted above, if you want extra declarations or closing newlines, you can put them in yourself before and after doing your Genx calls; but be aware that your output will no longer be Canonical XML.

Performance

The design of Genx takes some care to achieve good performance. However, there are some things you can do to help, and others which will slow it down; one function in particular can be used in optimizing or pessimizing performance.

The genxAddNamespace call instructs Genx to insert a namespace declaration; it must be called after starting an element and before any genxAddAttribute calls. You don’t ever need to call it; Genx will figure out when it needs to add namespace declarations on its own. However, if you have a bunch of elements or attributes, all in the same namespace, scattered all around your document, if you do a genxAddNamespace for that namespace on the root element, Genx won’t ever have to add another declaration, and your document will end up smaller, more readable, and quicker to transmit and parse.

On the other hand, genxAddNamespace can be called with an extra argument, a prefix to use, which need not be the same as the default prefix for that namespace. If you do this, performance will suffer grievously, as it makes a bunch of internal optimizations impossible and Genx has to laboriously examine its whole internal stack any time use you use that namespace again to make sure the right prefixes are in scope. (By the way, it’s good practice anyhow to use the same prefix for the same namespace throughout an XML document, so Genx rewards good practice with good performance.)

Genx also has a genxUnsetDefaultNamespace call, which does what its name suggests. If you use this, however, you will defeat a bunch of optimizations and make the namespace that used to be the default much slower to process.

Datatype Index

This section documents all the datatypes that appear in Genx’s published interface, found in the file genx.h.

genxStatus

typedef enum
{
  GENX_SUCCESS = 0,
  GENX_BAD_UTF8,
  GENX_NON_XML_CHARACTER,
  GENX_BAD_NAME,
  GENX_ALLOC_FAILED,
  GENX_BAD_NAMESPACE_NAME,
  GENX_INTERNAL_ERROR,
  GENX_DUPLICATE_PREFIX,
  GENX_SEQUENCE_ERROR,
  GENX_NO_START_TAG,
  GENX_IO_ERROR,
  GENX_MISSING_VALUE,
  GENX_MALFORMED_COMMENT,
  GENX_XML_PI_TARGET,
  GENX_MALFORMED_PI,
  GENX_DUPLICATE_ATTRIBUTE,
  GENX_ATTRIBUTE_IN_DEFAULT_NAMESPACE,
  GENX_DUPLICATE_NAMESPACE,
  GENX_BAD_DEFAULT_DECLARATION
} genxStatus;

This documents all the things that can go wrong. You can use the functions genxGetErrorMessage and genxLastErrorMessage to associate English-language messages with these codes. Here are some further notes on the ones that are actually used in the implementation:

GENX_BAD_UTF8

A violation of the UTF-8 encoding rules, as as documented in Chapter 3.10 of The Unicode Specification. That’s the chapter reference for Version Four of Unicode, anyhow, which is what I used to help me write Genx. The explanation of UTF-8 in Version Four is quite a bit better than in any of the earlier releases.

GENX_NON_XML_CHARACTER

The rule for what characters are legal in XML comes from the production labeled Char in the XML 1.0 specification.

GENX_BAD_NAME

The rule that applies here is the production labeled NCName in Namespaces in XML. The bad name could be an element type, an attribute name, a PI target, or a namespace prefix.

GENX_ALLOC_FAILED

This means that Genx failed to allocate memory for some reason that it has no hope of understanding and you probably have no hope of fixing, but at least you know.

GENX_BAD_NAMESPACE_NAME

This means that you tried to genxDeclareNamespace and passed NULL as a namespace name, which pretty well defeats the purpose. Or, you passed the empty string "", which would undeclare a default namespace except for Genx doesn’t do those.

GENX_INTERNAL_ERROR

Something is terribly wrong inside Genx, send mail to the bozo who wrote it, I think he’s named Ibrahim and lives in Singapore.

GENX_DUPLICATE_PREFIX

You tried to declare two namespaces with the same default prefix.

GENX_SEQUENCE_ERROR

Genx functions have to be called in a particular order, which is reasonably self-evident: You can only call genxAddNamespace and genxUnsetDefaultNamespace after a genxStartElement call and before any genxAddAttribute calls. Turning it around, genxAddAttribute can only be called after genxStartElement and possibly one or more genxAddNamespace/genxUnsetDefaultNamespace calls. This code means you got that order wrong.

GENX_NO_START_TAG

You called genxEndElement, but there was no corresponding genxStartElement call.

GENX_IO_ERROR

An I/O routine has complained to Genx, which is passing the complaint on to you, so it’s your problem now. If you used genxStartDocFile, the error comes from down in the stdio library, which probably means something is terribly wrong at a level too low for you to fix. If on the other hand you’re doing your own I/O via genxStartDocSender, you may be able to do something useful.

GENX_MISSING_VALUE

You called genxAddAttribute but used NULL for the attribute value; if you want it to be empty, use "" instead.

GENX_MALFORMED_COMMENT

A comment’s text isn’t allowed to either begin or end with -, nor is it allowed to contain --. You called genxComment with text exhibiting one of these problems.

GENX_XML_PI_TARGET

You tried to create a PI whose target was xml (in any combination of upper and lower case). XML 1.0 says you can’t do that.

GENX_MALFORMED_PI

You called genxPI with a body which included an illegal ?>.

GENX_DUPLICATE_ATTRIBUTE

You tried to add the same attribute to some element more than once. There’s no check whether you provided the same value or not; this is evidence of breakage.

GENX_ATTRIBUTE_IN_DEFAULT_NAMESPACE

You either tried to declare an attribute in a namespace whose default prefix is empty (i.e. it’s the default namespace), or tried to add an attribute which is in a namespace, and the currently-effective declaration for that namespace has an empty prefix, i.e. it’s the default namespace.

GENX_DUPLICATE_NAMESPACE

You tried to add two namespace declarations for the same namespace on the same element, but with different prefixes.

GENX_BAD_DEFAULT_DECLARATION

You tried to declare some namespace to be the default on an element which is in no namespace.

Character Types

#define GENX_XML_CHAR 1
#define GENX_LETTER 2
#define GENX_NAMECHAR 4

These are mostly used internally, but the utility function genxCharClass returns the OR of any that apply.

utf8

typedef unsigned char * utf8;

This is the flavor of text string that all GenX functions expect.

constUtf8

typedef const unsigned char * constUtf8;

You’d think that this would be the same as const utf8 but it’s not, since const applies a typedef at a time.

genxWriter

Opaque pointer type which identifies a writer object and is the first argument to most Genx calls; created with genxNew.

genxNamespace

Opaque pointer identifying a namespace; created with genxDeclareNamespace.

genxElement

Opaque pointer identifying an element; created with genxDeclareElement.

genxAttribute

Opaque pointer identifying an attribute; created with genxDeclareAttribute.

genxSender

typedef struct
{
  genxStatus (* send)(void * userData, constUtf8 s);
  genxStatus (* sendBounded)(void * userData, constUtf8 start, constUtf8 end);
  genxStatus (* flush)(void * userData);
} genxSender;

A user-provided package of I/O routines, to be passed via genxStartDocSender. Their names should be self-explanatory; for sendBounded, if you have s = "abcdef"; and you want to send abc, you’d call sendBounded(userData, s, s + 3);

API Index

This section documents all the function calls that appear in Genx’s published interface, found in the file genx.h.

genxNew

genxWriter genxNew(void * (*alloc)(void * userData, int bytes),
		   void (* dealloc)(void * userData, void * data),
		   void * userData);

Creates a new instance of genxWriter. The three arguments are a memory allocator and deallocator (see genxSetAlloc and genxSetDealloc), and a userData value (see genxSetUserData).

genxDispose

void genxDispose(genxWriter w);

Frees all the memory associated with a genxWriter.

genxSetUserData

void genxSetUserData(genxWriter w, void * userData);

The value passed in userData is passed as the first argument to memory-allocation (see genxSetAlloc) and I/O (see genxStartDocSender) callbacks. If not provided, NULL is passed.

genxGetUserData

void * genxGetUserData(genxWriter w);

Retrieves the value set with genxSetUserData, or NULL if none was set.

genxSetAlloc

void genxSetAlloc(genxWriter w,
		  void * (* alloc)(void * userData, int bytes));

The subroutine identified by alloc is used by Genx to allocate memory. Otherwise, Genx uses malloc.

genxSetDealloc

void genxSetDealloc(genxWriter w,
		    void (* dealloc)(void * userData, void * data));

The subroutine identified by dealloc is used by Genx to deallocate memory, but only if you called genxSetAlloc with a non-NULL argument.

If you set a non-NULL allocator with genxSetAlloc but no deallocator, Genx will never deallocate memory.

genxGetAlloc

void * (* genxGetAlloc(genxWriter w))(void * userData, int bytes);

Retrieves the allocator routine pointer (if any) set with genxSetAlloc.

genxGetDealloc

void (* genxGetDealloc(genxWriter w))(void * userData, void * data);

Retrieves the deallocator routine pointer (if any) set with genxSetDealloc.

genxDeclareNamespace

genxNamespace genxDeclareNamespace(genxWriter w,
				   constUtf8 uri, constUtf8 prefix,
				   genxStatus * statusP);

Declares a namespace. If successful, the genxNamespace object is returned and the genxStatus variable indicated by statusP is set to GENX_SUCCESS.

The prefix, if provided, is the default prefix which will be used when Genx has to insert its own xmlns:whatever attribute when you insert an element or attribute in a namespace that you haven’t previously done a genxAddNamespace call on; the default is also used when you call genxAddNamespace with a NULL second argument.

You can use "" for the default prefix to make this default to being the default namespace (xmlns=).

If the prefix argument is NULL and you haven’t previously declared this namespace, Genx generates a default prefix; the first is g1:, the second g2:, and so on.

If the prefix argument is NULL but you had previously declared a default prefix for this namespace, this is a no-op.

You can declare the same namespace multiple times with no ill effect.

Things can go wrong, signaled by a return value of NULL and a genxStatus code written into *statusP:

genxGetNamespacePrefix

utf8 genxGetNamespacePrefix(genxNamespace ns);

Returns the prefix associated with a namespace; particularly useful where the prefix has been generated for the caller by Genx.

genxDeclareElement

genxElement genxDeclareElement(genxWriter w,
			       genxNamespace ns, constUtf8 type,
			       genxStatus * statusP);

Declares an element. If successful, the genxElement object is returned and the genxStatus variable indicated by statusP is set to GENX_SUCCESS. You can declare the same element multiple times.

If the ns is NULL, the element is not in any namespace.

The only likely error is the type not being an NCName, in which case NULL is returned and *statusP is set appropriately.

genxDeclareAttribute

genxAttribute genxDeclareAttribute(genxWriter w,
				   genxNamespace ns,
				   constUtf8 name, genxStatus * statusP);

Declares an attribute. If successful, the genxAttribute object is returned and the genxStatus variable indicated by statusP is set to GENX_SUCCESS. You can declare the same attribute multiple times.

If the ns is NULL, the attribute is not in any namespace.

The only likely error is the name not being an NCName, in which case NULL is returned and *statusP is set appropriately.

genxStartDocFile

genxStatus genxStartDocFile(genxWriter w, FILE * file);

Prepares to start writing an XML document, using the provided FILE * stream for output.

genxStartDocSender

genxStatus genxStartDocSender(genxWriter w, genxSender * sender);

Prepares to start writing an XML document, using the provided genxSender structure for output.

genxEndDocument

genxStatus genxEndDocument(genxWriter w);

Signals the end of a document. Actually does very little aside from calling fflush if writing to a FILE *, the flush method of genxSender otherwise. Since Genx can detect when the root element has ended, perhaps this should be removed?

genxComment

genxStatus genxComment(genxWriter w, constUtf8 text);

Inserts a comment with the text provided. Can provoke an error if the text fails to follow the XML 1.0 rules for comment text: no leading or trailing -, and no embedded --.

Per Canonical XML, if the comment appears before the root element, it will be followed by a newline; if after the root element, it will be preceded by a newline.

genxPI

genxStatus genxPI(genxWriter w, constUtf8 target, constUtf8 text);

Inserts a Processing Instruction. Can provoke an error if the the target is xml in any combination of upper and lower case; or if the text contains ?>.

PIs outside the root element are equipped with newlines exactly as with comments.

genxStartElementLiteral

genxStatus genxStartElementLiteral(genxWriter w,
				   constUtf8 xmlns, constUtf8 type);

Start writing an element. The xmlns argument, if non-NULL, is the namespace name, a URI. Genx generates a prefix. If xmlns is NULL, the element will be in no namespace.

If you have previously declared a namespace for the namespace name, the prefix associated with that declaration will be used.

Errors can occur if the xmlns contains broken UTF-8 or non-XML characters, or the type is not an NCName.

This call is much less efficient than genxStartElement.

genxStartElement

genxStatus genxStartElement(genxElement e);

Start writing an element using a predeclared genxElement and (optionally) genxNamespace.

There is very little that can go wrong with this call, unless you neglect to start the document or have already called genxEndDocument.

genxAddAttributeLiteral

genxStatus genxAddAttributeLiteral(genxWriter w, constUtf8 xmlns,
				   constUtf8 name, constUtf8 value);

Adds an attribute to a just-opened element; i.e. it must be called immediately after one of the start-element calls.

The xmlns argument, if non-NULL, is the namespace name, a URI. Genx generates a prefix. If xmlns is NULL, the attribute will be in no namespace.

Errors can occur if the xmlns or value contains broken UTF-8 or non-XML characters, the type is not an NCName, or if you try to add the same attribute to an element more than once.

Since there is no DTD available, Genx does not do any attribute normalization. However, it does escape the characters <, &, >, carriage-return (U+000D), and " in the attribute value.

This call is much less efficient than genxAddAttribute.

genxAddAttribute

genxStatus genxAddAttribute(genxAttribute a, constUtf8 value);

Adds a predeclared attribute with an (optional) predeclared namespace to a just-opened element; i.e. it must be called immediately after one of the start-element calls.

Errors can occur if the provided value contains broken UTF-8 or non-XML characters, or if you try to add the same attribute to an element more than once.

Since there is no DTD available, Genx does not do any attribute normalization. However, it does escape the characters <, &, >, carriage-return (U+000D), and " in the attribute value.

genxAddNamespace

genxStatus genxAddNamespace(genxNamespace ns, constUtf8 prefix);

Inserts a declaration for a namespace, with the requested prefix, or with the default prefix if the second argument is NULL. If the requested prefix is not the default, this will have a significant impact on the performance of subsequent Genx calls involving this namespace. This is a no-op if a declaration of this namespace/prefix combination is already in effect.

You can’t use the same prefix for two different namespaces within a single start-tag, and you can’t use two different prefixes for the same namespace in the same scope.

This must be called after a genxStartElement call and before any genxAddAttribute calls or a GENX_SEQUENCE_ERROR will ensue.

genxUnsetDefaultNamespace

genxStatus genxUnsetDefaultNamespace(genxWriter w);

Inserts a xmlns="" declaration to unset the default namespace declaration. This is a no-op if no default namespace is in effect.

genxEndElement

genxStatus genxEndElement(genxWriter w);

Close an element, writing out its end-tag. The only error that can normally arise is if this is called without a corresponding start-element call.

genxAddText

genxStatus genxAddText(genxWriter w, constUtf8 start);
genxStatus genxAddCountedText(genxWriter w, constUtf8 start, int byteCount);
genxStatus genxAddBoundedText(genxWriter w, constUtf8 start, constUtf8 end);

Write some text into the XML document. This can only be called between start-element and end-element calls.

The text is processed by escaping <, &, >, and carriage-return (U+000D) characters.

In the first version, the text is zero-terminated; the Counted and Bounded versions allow the caller to avoid the zero-termination.

genxAddCharacter

genxStatus genxAddCharacter(genxWriter w, int c);

Add a single character to the XML document. The value passed is the Unicode scalar as normally expressed in the U+XXXX notation. Like genxAddText, this can only be called between start-element and end-element calls. This should not normally provoke an error unless the character provided is not a legal XML character.

genxNextUnicodeChar

int genxNextUnicodeChar(utf8 * sp);

Returns the Unicode character encoded by the UTF-8 pointed-to by the argument, and advances the argument to point at the first byte past the encoding of the character. Returns -1 if the UTF-8 is malformed, in which case advances the argument to point at the first byte after the point where malformation was detected.

genxCheckText

genxStatus genxCheckText(genxWriter w, constUtf8 s);

This utility routine checks the null-terminated text provided and returns one of GENX_SUCCESS, GENX_BAD_UTF8, or GENX_NON_XML_CHARACTER.

genxCharClass

int genxCharClass(genxWriter w, int c);

The argument is a single Unicode scalar character value. Returns an integer which is the OR of one or more of GENX_XML_CHAR, GENX_LETTER, and GENX_NAMECHAR.

genxScrubText

int genxScrubText(genxWriter w, constUtf8 in, utf8 out);

Copies the zero-terminated text from in to out, removing any bytes which are not well-formed UTF-8 or which represent characters that are not legal in XML 1.0. The output length can never be greater than the input length. Returns a nonzero value if any changes were made while copying.

genxGetErrorMessage

char * genxGetErrorMessage(genxWriter w, genxStatus status);

Returns an English string containing the error message corresponding to the provided genxStatus code.

genxLastErrorMessage

char * genxLastErrorMessage(genxWriter w);

Returns an English string containing the error message corresponding to the last error Genx encountered.

genxGetVersion

char * genxGetVersion();

Returns a string representation of the current version of Genx.

For the package you are reading, it returns:

getVersion

Acknowledgments

The design of Genx was substantially shaped by discussion in the XML-dev mailing list. Particular credit is due to John Cowan, David Tolpin, Rich Salz, Elliotte Rusty Harold, and Mark Lentczner; not that they or anyone but Tim Bray should be blamed for the inevitable infelicities and outright bugs herein.