Base64 encoding is here to stay. If you have ever had any hopes that base64 encoding in the Internet mail system would become obsolete, this article should shatter those hopes.
First, let's clear up some confusion. The terms “8-bit” and “binary” are not synonyms. Sometimes the term “8-bit clean” is used. “8-bit clean” also does not mean the same thing as “able to handle binary content.”
8-bit clean just refers to the fact that a program or communications channel does not change the most significant bit (msb) of the 8-bit bytes in the content. Back in the stone age, some programs used the msb for special purposes, and some programs cleared the msb. These days, 8-bit clean is almost meaningless, because programs that change the msb in any way are virtually non-existent.
8-bit content is not the same as binary content. 8-bit content is taken to mean content that includes no NUL characters. Binary content, on the other hand, may contain NUL characters. In a narrower interpretation, such as the MIME specification, 8-bit is taken to mean lines of text encoded into 8-bit bytes. In this narrower interpretation, lines are terminated by CR LF and a maximum line length is imposed. (As an interesting sidenote, because XHTML permits lines of any length, which may be terminated by LF alone, XHTML content is assigned the MIME type application/xhtml+xml, rather than text/xhtml as one might expect.) The key difference between 8-bit content and binary content is whether or not NUL bytes are allowed. By this definition, a JPEG file is binary content but not 8-bit content.
Now, there is no reason why 8-bit text should not be able to pass through almost any communications channel. However, binary content is still at risk. The NUL bytes in binary content can affect many programs that assume NUL terminates a character string. In addition, the conversion of end-of-line characters is hostile to binary content. FTP is culpable in this regard. In many FTP clients, text transfers are the default, meaning that end-of-line characters are converted and binary content is corrupted.
If we fixed common application programs -- if we made all FTP clients use binary mode by default and fixed every SMTP server and email client that can't handle binary content -- we would still have a programming environment that is unfriendly to the processing of binary content. The end-of-line characters issue is a major problem. The standard libraries for many programming languages -- C, C++, Perl, and Python come to mind -- open files in text mode by default, meaning that end-of-line characters are converted. The problem of end-of-line characters is never going away. Windows will always use CR LF as the end-of-line characters, and Linux/Unix will always use LF. For programmers, the choices they must make are not always trivial. Using text mode for I/O is convenient for writing cross-platform text processing utilities because it lets the run-time library deal with end-of-line character conversions. However, text mode I/O corrupts binary content.
Dealing with end-of-line character conversions is not the only issue with text processing, though. Modern text processing involves transcoding text from an external encoding to an internal Unicode encoding, processing it in some way, then transcoding back it back into an external encoding. Windows NT, Java, and .NET programmers know this procedure well. Binary content does not survive the transcoding from an external encoding to the internal encoding and back to the external encoding. This situation causes some problems for programmers, because the programming environment provides great facilities for processing text, but minimal facilities for processing binary content. Therefore, a programmer that must parse content that contains both text and binary content is tempted to convert the message to text in order to use the text processing facilities. A good example is processing an HTTP message in Java. The programmer is tempted to convert the HTTP message to an instance of String in order to parse the header fields, which are text. However, an HTTP message that contains an image contains a binary body, and the conversion to text corrupts the binary content. (See the footnote below.) The fact that text processing is so convenient means that programmers will always have a preferrence for text processing, and will face difficult choices when both text processing and binary data processing is required. And sloppy programming will result in corrupted binary data.
If you think about it, text content is actually a subset of binary content. A byte array may contain any kind of data, binary or text, including UTF-16 Unicode text. The opposite is not true: a character array may not contain unencoded binary data. It seems odd, though, that we programmers write lots of code for text processing, and we want to put binary content into text content rather than vice versa. Think about XML.
So, base64 encoding will be with us for a long time to come. That's probably not all bad. First, base64 encoding is extremely fast, as encodings go. Second, text processing using almost any programming language is much easier than binary data processing. By encoding binary content in base64, or even base16, we get the convenience of text processing utilities. Third, the expansion of binary content due to base64 encoding is not that bad, relatively speaking. Consider that Russian, Greek, Hebrew, or Arabic text encoded in UTF-8 will expand by more than the 33% that we see with base64 encoding of binary data. And fourth, base64 encoding protects binary content from end-of-line character conversions that happen all too frequently as content is moved around.
[Footnote: A good solution to the problem is to offer better facilities for processing binary content. In Hunny Software's JMIME library, we created ByteString and ByteStringBuffer classes that feel just like the String and StringBuffer classes -- but no transcoding happens, so that binary content remains uncorrupted. Similarly, in MIME.NET we created ByteString and ByteStringBuilder classes. This turns out to be a very elegant solution. One problem these classes solve is the need to parse text to find the character encoding. At first glance, the problem of character encoding seems to be a chicken-and-egg problem. You must know the text encoding in order to convert it to Unicode for parsing, but you must parse the text in order to discover the text encoding. With the ByteString facility, you can parse the text as a ByteString, discover the text encoding, then transcode the text to UTF-16. ByteString is also more efficient in certain types of text processing, too, because it avoids the transcoding to UTF-16 and back again.]
Posted by Doug Sauder at November 29, 2003 12:38 AM