November 29, 2003

Base64 Encoding Is Here To Stay

Base64 encoding is here to stay. If you have ever had any hopes that base64 encoding in the Internet mail system would become obsolete, this article should shatter those hopes.

First, let's clear up some confusion. The terms “8-bit” and “binary” are not synonyms. Sometimes the term “8-bit clean” is used. “8-bit clean” also does not mean the same thing as “able to handle binary content.”

8-bit clean just refers to the fact that a program or communications channel does not change the most significant bit (msb) of the 8-bit bytes in the content. Back in the stone age, some programs used the msb for special purposes, and some programs cleared the msb. These days, 8-bit clean is almost meaningless, because programs that change the msb in any way are virtually non-existent.

8-bit content is not the same as binary content. 8-bit content is taken to mean content that includes no NUL characters. Binary content, on the other hand, may contain NUL characters. In a narrower interpretation, such as the MIME specification, 8-bit is taken to mean lines of text encoded into 8-bit bytes. In this narrower interpretation, lines are terminated by CR LF and a maximum line length is imposed. (As an interesting sidenote, because XHTML permits lines of any length, which may be terminated by LF alone, XHTML content is assigned the MIME type application/xhtml+xml, rather than text/xhtml as one might expect.) The key difference between 8-bit content and binary content is whether or not NUL bytes are allowed. By this definition, a JPEG file is binary content but not 8-bit content.

Now, there is no reason why 8-bit text should not be able to pass through almost any communications channel. However, binary content is still at risk. The NUL bytes in binary content can affect many programs that assume NUL terminates a character string. In addition, the conversion of end-of-line characters is hostile to binary content. FTP is culpable in this regard. In many FTP clients, text transfers are the default, meaning that end-of-line characters are converted and binary content is corrupted.

If we fixed common application programs -- if we made all FTP clients use binary mode by default and fixed every SMTP server and email client that can't handle binary content -- we would still have a programming environment that is unfriendly to the processing of binary content. The end-of-line characters issue is a major problem. The standard libraries for many programming languages -- C, C++, Perl, and Python come to mind -- open files in text mode by default, meaning that end-of-line characters are converted. The problem of end-of-line characters is never going away. Windows will always use CR LF as the end-of-line characters, and Linux/Unix will always use LF. For programmers, the choices they must make are not always trivial. Using text mode for I/O is convenient for writing cross-platform text processing utilities because it lets the run-time library deal with end-of-line character conversions. However, text mode I/O corrupts binary content.

Dealing with end-of-line character conversions is not the only issue with text processing, though. Modern text processing involves transcoding text from an external encoding to an internal Unicode encoding, processing it in some way, then transcoding back it back into an external encoding. Windows NT, Java, and .NET programmers know this procedure well. Binary content does not survive the transcoding from an external encoding to the internal encoding and back to the external encoding. This situation causes some problems for programmers, because the programming environment provides great facilities for processing text, but minimal facilities for processing binary content. Therefore, a programmer that must parse content that contains both text and binary content is tempted to convert the message to text in order to use the text processing facilities. A good example is processing an HTTP message in Java. The programmer is tempted to convert the HTTP message to an instance of String in order to parse the header fields, which are text. However, an HTTP message that contains an image contains a binary body, and the conversion to text corrupts the binary content. (See the footnote below.) The fact that text processing is so convenient means that programmers will always have a preferrence for text processing, and will face difficult choices when both text processing and binary data processing is required. And sloppy programming will result in corrupted binary data.

If you think about it, text content is actually a subset of binary content. A byte array may contain any kind of data, binary or text, including UTF-16 Unicode text. The opposite is not true: a character array may not contain unencoded binary data. It seems odd, though, that we programmers write lots of code for text processing, and we want to put binary content into text content rather than vice versa. Think about XML.

So, base64 encoding will be with us for a long time to come. That's probably not all bad. First, base64 encoding is extremely fast, as encodings go. Second, text processing using almost any programming language is much easier than binary data processing. By encoding binary content in base64, or even base16, we get the convenience of text processing utilities. Third, the expansion of binary content due to base64 encoding is not that bad, relatively speaking. Consider that Russian, Greek, Hebrew, or Arabic text encoded in UTF-8 will expand by more than the 33% that we see with base64 encoding of binary data. And fourth, base64 encoding protects binary content from end-of-line character conversions that happen all too frequently as content is moved around.

[Footnote: A good solution to the problem is to offer better facilities for processing binary content. In Hunny Software's JMIME library, we created ByteString and ByteStringBuffer classes that feel just like the String and StringBuffer classes -- but no transcoding happens, so that binary content remains uncorrupted. Similarly, in MIME.NET we created ByteString and ByteStringBuilder classes. This turns out to be a very elegant solution. One problem these classes solve is the need to parse text to find the character encoding. At first glance, the problem of character encoding seems to be a chicken-and-egg problem. You must know the text encoding in order to convert it to Unicode for parsing, but you must parse the text in order to discover the text encoding. With the ByteString facility, you can parse the text as a ByteString, discover the text encoding, then transcode the text to UTF-16. ByteString is also more efficient in certain types of text processing, too, because it avoids the transcoding to UTF-16 and back again.]

Posted by Doug Sauder at 12:38 AM | permalink

November 18, 2003

Who needs a URL when you have Google?

URLs are fragile. Yet we persist in using them in bibliographic references in published books and journals. There is a better way.

Why not use a URN instead? Any capable search engine should be able to resolve the URN to a URL. In many cases, just a UUID would work. Stick a UUID in your document, publish the UUID, and then let anyone who wants to find your document search for the UUID. A standard way to use URNs could eventually be established. If we used URNs, documents could be moved around but we could still find them.

Some things could go wrong. URNs could be copied in order to spam those seeking a popular document. Modified documents could be disseminated, with the modifications difficult to detect. I think these problems could be fixed, but a solution doesn't come immediately to mind, except that one could include a domain name as part of the URN. A domain name cannot be easily hijacked, which would mitigate the spam problem.

Using URNs could also open up the door for content to be disseminated in other ways, such as file sharing networks.

So, here's a little test. I'm putting this UUID into this post. I'm going to try to find this post later by entering the UUID into a Google search.

urn:uuid:59c5b81b-49b4-4df5-bfb1-edda2e86cd66

Posted by Doug Sauder at 08:07 AM | permalink

November 15, 2003

POP3 Vulnerability

In my earlier post I mentioned how the POP3 servers I use do not support the APOP or AUTH command. Why is that so? Here's one thought: APOP and AUTH CRAM-MD5 require the server to compute an MD5 hash. I'm guessing that the ISPs have optimized the POP3 servers to handle login requests. It's common for users with always-on connections to poll the POP3 server frequently to check for new mail. If a large percentage of the server's processing time is spent handling login requests, rather than actually transferring messages, then it makes sense that they would do this. And I could easily imagine that a login that requires computing an MD5 hash could be an order of magnitude more expensive than a simple password lookup. This seems to be a reasonable explanation.

Having said that, though, sending passwords in the clear is never a good idea. And in the case of POP3, the vulnerability goes beyond just reading someone's mail. Two of the ISPs that I use also support authenticated SMTP. The authentication is through POP3. That means you must log in first to the POP3 server, and then for a limited period of time, the SMTP server will relay messages that originate from the same IP address. Therefore, if someone were to discover the POP3 user name and password, they could use the SMTP server as though it were an open relay. I would guess that there are many POP3 accounts where the user name and password are not that difficult to guess, so sniffing might not even be necessary.

Posted by Doug Sauder at 09:49 AM | permalink

Is the IETF Becoming Irrelevant?

I log in many times a day to three different POP3 servers. Not one of these POP3 servers supports the APOP or AUTH command. That means my password is sent across the Internet many times a day in clear text. APOP is specified in RFC 1460, published in 1993. AUTH is specified in RFC 1734, published in 1994. There are good, open source -- that is, free -- mail servers that support it. So why don't any of these ISPs support APOP or AUTH? I'm just guessing that it's lethargy. Or maybe apathy.

Considering the situation with POP3, as well as many similar situations that I won't bother to mention, I have to wonder if the IETF has now become irrelevant. There is no simple answer to the question, of course. The IETF is relevant in certain areas, irrelevant in other areas. I do believe, though, that the Internet has become so large that the IETF cannot create a new protocol that will be widely adopted. The IETF, for instance, is helpless to solve the spam problem: it cannot fix or replace SMTP.

In contrast to the IETF, blogging is happening. RSS is happening. Atom is happening.

Posted by Doug Sauder at 09:30 AM | permalink

November 13, 2003

Metadata Overloaded

Let me ask a rhetorical question: What is metadata?

Here's a really simple answer: Metadata is data about data.

In practice, the term metadata seems to be taking on a meaning of its own. Consider the Dublin Core Metadata Initiative. The elements of the Dublin Core include Creator, Subject, Date, Format, and so on. Are these elements metadata? Or are they attributes?

From a certain perspective, I suppose you could say the Dublin Core elements are data about data. If I were not a programmer -- if I were a journalist, for instance -- then I would probably think of the content of an article as the data and the other information -- when it was published, a brief description, copyright information, and such -- as data about the data, and hence metadata.

However, as a programmer, I know that data also has to be processed, and that metadata contains "data about the data," which provides information related to the processing of that data. To me, the Dublin Core metadata elements are just additional attributes. In the case of a published article, the article has a main body, which is its primary attribute, and the Dublin Core "metadata," which are secondary attributes, but attributes nonetheless. The metadata, on the other hand, includes information such as the MIME type and the text character encoding -- which are all data elements related to the processing of the data.

In short, the term "metadata" is overloaded. One man's data is another man's metadata.

Here are some more examples:

  • In .NET programming, a compiled assembly contains metadata about the types the assembly defines. I agree with this use of the term metadata.

  • In the new WinFS file system announced by Microsoft as part of the next version of Windows, stored items contain metadata. The metadata includes attributes similar to the DC metadata: the creation date of the content, the author, the subject, links to related items, and so on. For much of this data, I disagree with the use of the term "metadata." The date that the content was created is not metadata: it is an attribute of the content. If the content is a photograph, the time the picture was taken is not metadata, but the format of the data -- JPEG, BMP, or TIFF -- is metadata.

  • In an email message, the subject, the list of recipients, the sender, the date, and the subject are not metadata. The information in the MIME header fields is metadata: the content type, the transfer encoding, and the text character encoding.

  • XML Schema provides metadata about XML content. XML tags themselves provide metadata about the information in an XML document.

Let me rephrase my original question: What is "data about the data"?

Posted by Doug Sauder at 11:20 PM | permalink

Offshoring Not a Level Playing Field

From this article in McKinsey Quarterly, we read:

Companies in the United States and Britain account for roughly 70 percent of the market of companies that are moving their business processes offshore. Relatively liberal employment and labor laws give such companies flexibility in reassigning their activities and eliminating jobs, and they can take advantage of the sizable English-speaking populations in many low-wage countries, such as India, Ireland, the Philippines and South Africa. With a shared language, errors are far less likely and functions that require voice interaction or text-based work are straightforward. The opportunities for continental European and Japanese companies are thus more limited.

and this

India each year produces 2 million college graduates -- more than 80 percent of them English speakers -- while China produces 850,000, though with minimal English skills. Even a small country like the Philippines annually produces 290,000 college graduates, all English speakers.

Will offshoring yield a significant advantage to the US and the UK? It could.

The US and the UK would seem to have a language advantage, in that English is the lingua franca that allows offshoring to work better in English than in French, German, Japanese, or other languages. This is especially the case with the offshoring of call centers. But to the extent that language is a barrier, the advantage probably applies to other activities, including software development.

Globalization really is a fascinating topic.

Some questions: Will the US and the UK truly have an advantage, since English is a lingua franca? How will language differences affect globalization? How will globalization affect English.

Posted by Doug Sauder at 12:51 AM | permalink

November 11, 2003

Microsoft wants to punish

Microsoft offers a bounty to catch MSBlash and SoBig creators. What's the point?

The messenger brings bad news, so you shoot the messenger, right? That's what this seems like. I know it's not an analogy. But the creators of these worms did embarrass Microsoft by exploiting security flaws. And now Microsoft feels that they should be hunted down and punished.

It's so strange that they should announce this bounty. I had been thinking over the past few weeks that Microsoft, if it is really serious about building secure software, should offer a cash award for anyone who reports a security flaw in Microsoft's software. That makes sense. You can hire a few full-time employees to search out and report security flaws. Or, you can offer a cash reward to others who will search out and report the security flaws. Or, you can do both. Microsoft should do both.

I'm concerned about the impact of the proposed bounty. The regular release of malware is the normal state of the Internet. There are bad people out there who want to deliberately cause damage. The fact that those people are out there keeps us vigilant. It's smart to be mistrustful by default on the Internet. If we are able to catch the creators of the most visible malware, then what remains will be more insidious malware, including malware that is more difficult to detect. The threat is no less serious, just not as visible. I prefer the visible malware, because it demands that we take immediate action.

I'm disturbed that so few people take computer security seriously. Of the three POP3 accounts that I have currently with various service providers, all of them require me to send a password in clear text. We know that sending passwords in the clear is bad for security, and we have had alternatives available for many years. So why do these service providers not allow an alternative to sending cleartext passwords? It's because of complacency. When there is an attack that affects the service provider, then they are jolted out of their complacency to take security seriously. That's why we need a highly visible breach in security a few times a year. Without that, we would not make any progress in computer security.

As for Microsoft, trying to punish anyone who would dare to embarrass it is pointless.

Posted by Doug Sauder at 07:58 AM | permalink