Backward Compatibility of Character Set Encodings
I recently spent some time using Linux. Programs for Linux aren’t perfect. Sometimes they have bugs. One of my worry is about the compatibility of character set encodings.
The distribution I installed was Red Hat Fedora Core 2. By default, it uses UTF-8 encoding for all languages. This is good for internationalization. The worry about character set encoding compatibility is not whether UTF-8 supports Chinese, but is whether Linux programs are fully compatible with UTF-8.
Let’s see an example. The following code illustrates a typical string comparison:
return (strstr(s, "/..") != NULL) ? 1 : 0;
My concern is, if the encoding of the character set is multi-byte, then there may be a case like this: A character may be composed of two bytes. The first byte of the multi-byte character is >= 0x80, but the second may be less than that. If the second byte happend to be ‘/’, then the strstr comparison may return true, but the ‘/’ actually doesn’t represent a whole character.
In GBK, ‘/’ is never used as the second byte, but ‘\’ is used. Lately, I read an article about UTF-8 on wikipedia. It promised me that UTF-8 doesn’t use byte ranging between 0x00 and 0x7F in multi-byte charaters. Then I was happy. GB2312 also has this feature, while GBK doesn’t.
Then the work is simpler. Most code that don’t need to use whole characters or extended ASCII characters should become compatible automatically. Only when things must be done with characters rather than bytes should be done carefully, e.g., truncating and displaying.
There is an issue for console displaying. GBK and GB2312 has a feature over UTF-8. In console mode, there is no graphical function calls like GetTextExtent. The GetTextExtent function returns the size of the character. In console mode, a character would take one or two ASCII character positions. If it would take one, then GBK and GB2312 would have it represented in one byte. If it would take two, then GBK and GB2312 would have it represented in two bytes. UTF-8 doesn’t have this feature. So in order to determine the size of the character to display under console environment, it is a way to convert the character to GBK or GB2312.