January 15, 2004

XML and Information Theory

Greg Reinacker stirred up a controversy by announcing that he would update the Atom parser in NewsGator to parse and display badly formed XML. Those who criticize Greg's decision point to the fact that if parsers reject badly formed XML, then that puts pressure on content creators to fix their badly formed XML. Greg bases his decision on what he thinks his customers would want.

In this piece, I won't take a position one way or another. But I would like to enter some more information into the debate.

I claim that the ability to parse badly formed files is a feature of all text-based formats, including XML. If Atom were a binary format, like BER-encoded ASN.1, or like PNG, or like zip, then there would be no controversy. When a PNG file is badly formed, no one expects a parser to do anything but reject it. However, with a text-based format, whether HTML, or XML, or MIME, or mbox, or .ini, or any other text-based format, a clever parser may succeed in parsing a badly formed file.

Having a background in information theory, I see the situation from a more theoretical view. One of the features of text-based protocols is that the information rate is low. In other words, the message format contains a lot of redundancy. In XML, we can easily recognize the redundancy: white space between elements is insignificant, the closing tags are more verbose than they need to be (</> would suffice), and overall, XML is more verbose than it needs to be. But even at a lower level, the alphabet XML uses is constrained: most control characters are not allowed, and some characters such as '<' and '&' have special meaning. In contrast, binary protocols have a higher information rate, as redundancy is squeezed out.

It is a basic principle of information theory (Shannon's Theorem) that one can transmit information reliably over a noisy channel, provided one uses a clever encoding, and provided the information rate is below the channel's maximum rate. When we think about text-based protocols, we can think of the badly formed messages as having been affected by a noisy channel -- the defects are "noise." Because text-based protocols have a low information rate, that means, theoretically, it's possible to pass the messages through a noisy channel and convey the information reliably. In other words, the messages can be badly formed, but a parser may be able to recover the information.

It's interesing to consider how redundancy works, practically, to make text-based protocols more robust. I present one good example here. In many text-based protocols, CRLF has special meaning: it divides the message into records. CRLF is not allowed within the records. Because CRLF affects the way a message is displayed in a text editor, it is easy to see where CRLF occurs illegally. Therefore these errors occur infrequently. In contrast, in the BER-encoding of ASN.1, records are indicated by a length field that precedes the record. In general, text-based protocols use record separators (such as CRLF) and field separators (such as white space), rather than length indicators, and the separators are more robust. But separators may only be used if the content of the record or field is constrained; hence, the redundancy.

One more example. In XML, closing tags are verbose. We use </tag> where, for example, </> would suffice. Because we use verbose closing tags, it's possible to write a clever parser that can recover from certain errors, like a missing closing tag.

So, what's all this have to do with the controversy over the handling of badly formed XML? A simple observation: people will write forgiving XML parsers because they can.

Posted by Doug Sauder at January 15, 2004 07:26 AM