Fri 13 Oct 2006
Listing Double Byte Characters
Posted by datacrush under Techs
Unicode is a blessing and code page is a faux pas in multilingual environment.
Some developers have asked me about code page compatibility and character sorting sequence. While it may be well documented and easily found with some Googling effort, I offered a suggestion to list all possible double byte characters.
All characters are sorted by hex value except for a few exceptions. We do know that in Latin characters A to Z, A has lower hex value than Z, which if sorted in ascending order A would be returned before Z.
But what about graphical characters like Japanese and Chinese?
Few years ago in a China banking project, I wrote a program to loop 65535 times and convert every possible integer for double byte to character (0×0000 to 0xFFFF). I found that Chinese characters are sorted by their pronunciation in Latin alphabetical order and pronunciation tone. Chinese character 伸 (shen) appears before 是 (shi) and 人 (ren) before 壬 (ren). The difference between the two ren is in their tonal grouping.
Such a program is simple to code in Java. Here’s an example:
OutputStream os = new FileOutputStream("chars.txt");
byte[] cr = "n".getBytes();
int x = -1, y = 0;
while (x++ < 256) {
…for (y = 0; y < 256; y++) {
……os.write(new byte[]{(byte) x, (byte) y, cr[0]});
……os.flush();
…}
}
If coding in RPG, create a data structure with two overlapping fields, 2 bytes each; One in unsigned integer (5U) and the other in character (2A). For reference sake, let’s name the two fields “ds.UInt” and “ds.Char”. Next, loop “ds.UInt” from 0 to 65535, and use “ds.Char” to evaluate to a result field padded with Shift-Out (0×0E) and Shift-In (0×0F). The result field should be 4 bytes long.
Depending on the code page you use to view or retrieve the result, you will get DBCS listing ordered by hex value for the language of the code page.