もじコード【文字コード】


■ 文字コード means the numerical codes that are use to represent written characters in computers and telecommunications. There are code systems for many writing systems, with several different systems used for Japanese alone.

The main code for Japanese characters is laid down in the JIS X 0208-1997 standard. It has the kana, 6,355 kanji, a lot of symbols, and the more-or-less complete Latin (known as JIS-ASCII), Greek and Cyrillic alphabets.

Let's take as an example the very first kanji in JIS X 0208, which is 亜. In the standard it is defined by a pair of decimal numbers which place it in the 94 x 94 matrix used in that standard. This is known as the 区点(くてん) code of the kanji:

Kuten for 亜: 16-01
The raw "JIS" code is formed by adding by adding 32 (hexadecimal 20) to (each of) the kuten pair. This moves the code up into the "printable ASCII" range where it won't upset software by pretending to be escapes, line-feeds, etc.
(Raw) JIS for 亜: 3021 (hex) or 0! (ASCII)
The EUC (originally Extended Unix Code) coding is formed by turning on bit-8 (MSB) of the raw JIS code:
EUC for 亜: B0A1
The Shift-JIS is formed by putting the 14 bits of the raw JIS code through a transformation to make a pair of bytes. The (MSB, i.e. Most Significant Bit) of the first is always set, and the second always lies in or above the printable ASCII range.
Shift-JIS: 889f
For the sake of completeness, the JIS/ISO-2022-JP coding wraps the "raw" JIS in escape sequences:
ISO-2022-JP for 亜: ^[ $ B 0 ! ^[ ( B
Note that as many raw JIS codes as you like can be in the wrapper, although the RFC for Japanese in email limits it to 72.

There are some differences between JIS coding and ISO-2022-JP that are omitted here. There is also the Unix X-Windows CTEXT_JA (Compound Text) coding, which is similar to JIS/ISO-2022-JP. You only need to worry about it if you are deep into X.

EMAIL AND NEWSGROUPS

In general, Japanese text included in emails and news messages should be in the JIS/ISO-2022-JP codings. Originally there were two reasons for this: (a) the old email standards (developed by and for Americans) prohibited 8-bit characters in emails; (b) a lot of communications software only passed 7-bit characters. (Some even removed escape characters.)

In addition, some other software, such as window handlers, mail readers, etc., which had not been written for 8-bit text, were known to crash when subjected to 8-bit codes.

Many of these reasons no longer apply, especially the communication paths, which are almost universally 8-bit clear, but enough people still run old software that the receiver-friendly thing is to use JIS/ISO-2022-JP codings.

WWW PAGES

Fortunately, WWW pages are in better shape than mail and news. You are free to code Japanese in WWW pages in EUC, Shift-JIS or ISO-2022-JP; just don't mix them in the same page.

Browsers can be set to EUC/Shift-JIS, or to "autodetect" the codes. ISO-2022-JP can always be detected accurately, but telling the other two apart can be chancy, as the ranges overlap. Some people put characters in unique ranges in a comment at the front of the page to make sure the detection stays on the rails.

The best way to head up a page with Japanese in it is to have a "META HTTP" directive at the front. Here is what is on the jeKai pages:
	<META HTTP-EQUIV='Content-Type' CONTENT='text/html;CHARSET=x-sjis">
This both tells the browser what the codes are in the page, and, more importantly, tells it how to code Japanese text in the input fields in a form before sending it to a server.

WHY SHIFT-JIS?

Why is there a "shift" in Shift-JIS? The reason actually goes back to an earlier standard now called JIS X 0201, which is sort-of the extended Japanese version of ASCII. JIS201 is an 8-bit code, and thus has 256 possible values. The first 128 are pretty much ASCII, except that the backslash (\, or / leaning in the opposite direction) is replaced by the Yen symbol. As this standard, or to be more precise its predecessor, was one of the first to have Japanese characters in it, it was highly desirable to have at least a set of kana available for the early computers in Japan. As the full set of kana, when all the nigori/maru diacritical marks are added, overtaxes the space available in 127 codes, it was decided to use just the basic katakana set, and have the diacritic marks as separate characters. This led to the so-called "half-width kana" (半角(はんかく)カナ). In the early 1980s, electricity invoices, long-distance train tickets, etc. were in this rather clumsy form, and half-width kana continued to be used two decades later. (See examples here.)

Hankaku kana is from the button-boots era of Japanese computing, but Microsoft, in their Infinite Wisdom, decided that it was terribly necessary, once the full kanji set became available as a two-byte code, to enable files and text for/from legacy systems to co-exist with the newer form. And thus arose "Shift-JIS", where the codes in the JIS X 0208 standard are shifted aside to make room for the old hankakukana codes from JIS201.

In this author's biased, after-the-event view, this was unnecessary, and has caused a much bigger problem for the information processing industry in Japan than would have been the case had the EUC approach been adopted universally. It is not that Shift-JIS is that complicated a code. It is because a large proportion of the two-byte code-space was wasted supporting an obsolete code. This has severely limited the range of kanji and other special characters that can fit in two bytes. In theory a two-byte sequence with the MSB on the first can code 32,000+ characters. Shift-JIS is limited to about 10,000, thanks to its support for JIS201 kana. Moreover, it has helped keep JIS201 half-width kana alive and well. Just look at a cash-register docket next time you are in Japan.

■ 文字コード have been the subject of some controversy in Japan. The original version of JIS X 0208 (JIS C 6226-1978) was compiled using several industrial "standards", and included almost every kanji used in the official names of towns. In this the standards committee made some blunders, misreading several hand-written kanji and thus inventing several kanji such as 墸, which cannot be found in Morohashi. Soon after the establishment of the original JIS character set, objections were raised to the omission or abbreviation of some fairly common characters. One often cited in objections is 鴎(かもめ), which appears in the name of the author 森鴎外(もりおうがい). Traditionally, this character has been written (represented here by a GIF image rather than by a character code).

The JIS committee got into more hot water with the 1983 revision of the standard, when they replaced several kanji with simplified forms hitherto unknown in Japan. Various fiddles were done at this time to handle the expansion of the 常用漢字 and the 人名用漢字.

In 1990, in order to address the problem of missing kanji, a further standard: JIS X 0212-1990, described as the 補助漢字, was introduced. This added a further 5,801 kanji, plus some other characters missing from JIS X 0208, such as odd-ball kana moras and Latin characters with diacritics. While JIS X 0212 went a long way to solving complaints about missing kanji, it has been a total non-event. Since there is no room in Shift-JIS to carry these extra characters, virtually no-one has actually implemented the standard (you can see them using WWWJDIC.) The かもめ kanji above is in JIS X 0212.

The early 1990s saw the emergence of the Unicode and ISO-10646 standards, which attempt to pull all the national code-sets into a single compatible set. A major part of these has been the "Han Unification", in which the kanji/hanzi/hanja from Japan, the PRC, Taiwan and Korea were merged into a single block of about 21,000 characters. The unification rules were quite strict, and thus every distinct kanji in JIS X 0208 and 0212 was included, even the cases such as 劒 and 劔 which are clearly orthographical variants. Kanji not in those standards, and not in the Chinese sets such as Big5, missed out totally.

The controversy over codes heated up in the late 1990s when a group of writers and critics headed by the late 江藤淳(えとうじゅん) made statements and held symposia to protest the limited number of kanji available on personal computers and to decry the Unicode standard, which has been seen by some Japanese to be dominated by Chinese or American interests. Their sometimes xenophobic arguments met sharp rebuttals from other critics, particularly those with experience in typesetting or desktop publishing.

Some groups in Japan and China unhappy with current coding systems have built operating systems using alternative character sets such as the huge CCCII (Chinese Character Code for Information Interchange) that seeks to remedy the perceived deficiencies in the current standards, but there has been little sign that the new systems will be widely adopted. The controversy seems likely to be settled by the adoption of Unicode as the de facto standard, especially since Microsoft is publicly committed to its use in all platforms.


This entry was created by Jim Breen. The original version of the first part of the document is available here. The section on the 文字コード controversy was created by Tom Gally and extensively amended by Jim Breen.


Created 2000-08-15.


jeKai Index Home