1年前にブログ出張すると書きました。1年経って、ありえるえりあに戻ってくることにしました。
結局、出張先ブログであまり記事を書いていないので、ブログを書く習慣が微妙に失われている気がします。
先日、インド学生相手にプレゼンをしたのでその資料を公開します。タイトルは「Internationalization(国際化プログラミング)」ですが、プレゼンを聞いた某氏から、文字コードの話しかしていないじゃないですか、と嫌味を言われました。そのとおりです。自分の英語力では、難しい話はできませんでした。説明が簡単であることは、外国人と一緒の開発では重要な要素です。
inoue-i18n.tar.gzを展開してi18n.htmlファイルをWebブラウザで開いてください。
以下、プレゼン資料の文字部分です。
Internationalization
INOUE Seiichiro
Ariel Networks, Inc. CTO
Works Applications Guest Fellow
Who I am
- CTO of Ariel Networks, Inc.
- Guest Fellow of ATE division at Works Applications
- Formerly, a developer of Lotus Notes at Boston
Who I am (cont.)
- My books and articles:
- “Perfect Java”
- “Perfect JavaScript”
- “P2P textbook”
- “Server Side JavaScript”
- etc.
Today’s Topic
- Internationalization (in short, i18n)
- history and concept
- character code
Agenda
- i18n overview
- languages and characters in computer
- History of character code
- Unicode & Java String
1. i18n overview
What i18n is
making software handle
country/region/culture specific things
locale
country/region/culture specific things
=> locale
1 2 3 4 5 |
e.g. On Unix system $ man locale # show the manual $ locale -a # show all locales supported by your system $ echo $LANG # show the current (shell) locale |
i18n examples (1)
- language
- input
- output
i18n examples (2)
- timezone
- calendar (e.g. Japanese era, holidays, etc.)
- date/time format (e.g. 2013/7/3, Jul/3/2013)
i18n examples (3)
- number format (e.g. 100,000.00 symbol for decimal point)
- monetary/currency (e.g. dollar, yen, rupee)
- measurement units (e.g. mile, feet)
i18n examples (4)
- name format (e.g. order of family name/given name, middle name, Mr/Ms)
- address format (e.g. postal code, the order of address elements)
- telephone number format (e.g. number of digits)
- icon (e.g. postbox)
i18n evolution
1 2 3 4 5 |
Traditionally, let software have per-process locale Nowadays, let software have per-user / per-content locale |
2. languages and characters in computer
what we should do
two steps:
- define character set
- assign number (=code point) to each character
in order to define character set
we should know what characters are
- definition of character
- the elements of written languages
Question; what is character?
1 |
a, A, あ, 愛, @, ... |
Question; what is character? (cont.)
Question; what is character? (cont.)
- resolved only by agreement
- we have to think about languages
language examples
- Japanese
- English
- French
- Hindi
- Marathi
- Chinese
- …
language/script separation
1 2 3 4 5 6 7 8 |
some languages uses multiple groups of characters. e.g. Japanese uses 'hiragana', 'katakana', 'kanji', 'number', 'mathematical symbols', etc. some characters are used by multiple languages. e.g. Many languages uses 'latin alphabet', 'number', etc. 'a group of characters' => script |
script examples
- latin alphabet
- hiragana
- katakana
- kanji
- hangul
- devanagari
- …
language/script mapping
1 2 3 4 5 |
M N language <---> script cf. http://www.unicode.org/cldr/charts/supplemental/languages_and_scripts.html |
character set
1 2 3 4 |
define a character set from scripts M N character set <---> script |
character set examples
- ASCII (latin alphabet, some symbols)
- JIS 0208 (latin alphabet, hiragana, katakana, kanji, etc.)
- ISCII (devanagari, etc)
- …
Unicode (try to contain all scripts in the world)
Where should languages be taken care of?
1 2 3 4 5 |
According to Unicode standard, it is application's responsibility e.g. HTML's lang attribute <html lang="ja"> |
3. History of character code
brief history of character set
- 1963 ASCII
- 1967 ISO/IEC 646
- 1969 JIS 0201
- 1973 ISO 2022
- 1978 JIS 0208
- 1982 CP932(MS-DOS)
- 1985 EUC-JP
- 1987 ISO 8859
brief history of character set (cont.)
- 1988 Unicode 88
- 1991 ISCII
- 1992 Unicode v1.0
- 1993 ISO 10646
- …(Unicode version up)
- 2012 Unicode v6.2
Terms
- ISO: International Organization for Standardization
- IEC: International Electrotechnical Commission
- JIS: Japanese Industrial Standards
- ASCII: American standard code for information interchange
- ISCII: Indian Script Code for Information Interchange
European languages history
1 2 3 4 5 6 7 |
ASCII(ISO 646): English alphabet ISO 8859: various European alphabets => Both are 1 byte(octet) char => Precisely, 7-bit char => easy to use them simultaneously (determine by the highest order bit) |
Japanese history (1)
1 2 3 4 |
JIS 0201: half-width katakana (a.k.a. hankaku kana) - 1 byte char (7-bit char) - similar to ISO 8859 - easy to use ASCII and JIS 0201 simultaneously |
Japanese history (2)
1 2 3 4 5 6 7 |
JIS 0208: Hiragana, Katakana, Kanji (a.k.a. zenkaku(full-width)) - 2 byte char - easy to use ASCII and JIS 0208 simultaneously (determine by the highest order bit) - not easy to use JIS 0208 and JIS 0201 simultaneously - ISO 8859 have the same issue => ISO 2022 |
ISO 2022
1 2 3 4 5 6 7 8 9 10 11 |
How can we use JIS 0208(Kanji) with ISO 8859(European scripts)? => Switch multiple character sets by control characters (in some cases, we can omit control characters) e.g. abcああabc --> 0x61 0x62 0x63 0x1b 0x24 0x42 0x24 0x22 0x24 0x22 0x1b 0x28 0x42 0x61 0x62 0x63 |
Japanese history (3)
1 2 3 4 5 6 |
Legacy code - EUC-JP (ISO 2022 compatible) - Shift-JIS (Microsoft code) - ISO 2022 JP (a.k.a. JIS code) => being replaced by Unicode |
4. Unicode & Java String
Unicode
The Unicode Consortium and The Unicode Standard
http://www.unicode.org/
- character set
- encoding scheme
- collation rule
- various algorithm (e.g. BiDi)
Unicode brief history
- Unicode 88 …16bit code
- Original ISO 10646 …32bit code
- ISO 10646 accepted Unicode … super set of Unicode. 32bit code
- ISO 10646/Unicode grows …21bit code
ISO/IEC 10646
1 2 3 4 5 6 7 |
UCS(Universal Coded Character Set) UCS-2: code range: U+0000 ... U+FFFF UCS-4: code range: U+000000 ... U+10FFFF => UCS-2 == BMP(Basic Multilingual Plane) of UCS-4 => UCS-2 is deprecated, so that UCS almost means UCS-4 |
Encoding scheme
UTF(Unicode/UCS Transformation Format)
- UTF-8
- UTF-16
- UTF-32
- (UTF-7)
UTF-8
- 8-bit variable-width encoding (a.k.a. multi-byte char)
- upper compatible with ASCII
- getting the standard in Internet
UTF-8 (cont.)
1 2 3 4 5 |
UCS Code point UTF-8 00000000 0aaaaaaa --> 0aaaaaaa 00000aaa aabbbbbb --> 110aaaaa 10bbbbbb aaaabbbb bbcccccc --> 1110aaaa 10bbbbbb 10cccccc 000aaabb bbbbcccc ccdddddd --> 11110aaa 10bbbbbb 10cccccc 10dddddd |
UTF-8 conversion example
UTF-8 conversion example (cont.)
1 2 3 4 5 6 7 8 9 10 |
あ U+3042 00110000 01000010 aaaabbbb bbcccccc == UTF-8 conversion == 11100011 10000001 10000010 aaaa bbbbbb cccccc 0xe3 0x81 0x82 |
UTF-16
- 16-bit, variable-width encoding
- Most of characters are encoded in 16-bit fixed-width (same as BMP/UCS-2)
- Using surrogate pair for characters other than BMP
- Sometimes, use BOM(Byte Order Mark)
- Java string internal encoding scheme
UTF-16 Surrogate pair
1 2 3 4 5 6 7 8 9 10 11 |
(1) U+0000 - U+ffff (except U+d800-U+dfff) => 16bit encoding (2) U+1ffff - U+10ffff 000a aaaa xxxx xxyy yyyy yyyy -> 1101 10ww wwxx xxxx 1101 11yy yyyy yyyy 21bit -> 32bit (wwww = aaaaa - 1) |
UTF-16 Surrogate pair(cont.)
1 2 3 4 |
Reserved area for surrogate pair (= a kind of control code) U+d800 - U+dbff : 11011000 00000000 - 11011011 11111111 and U+dc00 - U+dfff : 11011100 00000000 - 11011111 11111111 |
UTF-16 conversion example
UTF-16 conversion example (cont.)
1 2 3 4 5 6 7 8 9 |
U+29e15 00000010 10011110 00010101 aaaaa xxxxxxyy yyyyyyyy == UTF-16 conversion == 11011000 01100111 11011110 00010101 ww wwxxxxxx yy yyyyyyyy U+d867 U+de15 |
UTF-32
- 32-bit fixed-width encoding
- Caution: considering composed characters, one character still cannot be 32-bit width
fyi, Unicode character database
- http://www.unicode.org/charts/
- http://www.unicode.org/charts/charindex.html
Java String
- The internal encoding is UTF-16
- Not UCS-2, BMP
- (Before Java 1.4, it used to be UCS-2)
- Without considering surrogate pair and composed characters, easy to use (16-bit fixed width encoding)
Java String’s surrogate pair
1 2 3 4 5 |
public static void main(String[] args) { String s = "\ud867\ude15"; // UTF-16 surrogate pair System.out.println(s.length()); //=>2 System.out.println(s.codePointCount(0, s.length())); //=>1 } |
Composed characters
- precomposed character
- combining
=> normalization (to canonical format)
Java String’s composed characters
1 2 3 4 5 6 7 8 9 10 |
public static void main(String[] args) { String s = "\u3052"; System.out.println(s + ", length=" + s.length()); s = "\u3051\u3099"; System.out.println(s + ", length=" + s.length()); s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC); System.out.println(s + ", length=" + s.length()); } |
Java String’s composed characters(cont.)
1 2 3 4 |
Output: げ, length=1 げ, length=2 げ, length=1 |
Java String’s composed characters (cont.)
Java String’s composed characters (cont.)
Java String’s composed characters (cont.)
1 2 3 4 5 6 7 8 |
public static void main(String[] args) { // devanagari String s = "\u0928\u092e\u0938\u094d\u0924\u0947"; System.out.println(s + ", length=" + s.length()); s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC); System.out.println(s + ", length=" + s.length()); } |
Java String’s composed characters (cont.)
1 2 3 |
Output: नमस्ते, length=6 नमस्ते, length=6 |
What collation is
sort and equality algorithm of characters
- In ASCII era, collation is similar to character code
- However, we sometimes need a different sort algorithm such as case-ignore
- usual order: A,B,a,b
- case-ignore order: A,a,B,b
Unicode collation
- Each character has a collation value
Java String’s collation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
public static void main(String[] args) throws Exception { String[] arr = new String[] { "かか", "がが", "カイ", "さ" }; System.out.println("normal sort:"); Arrays.sort(arr); for (String s : arr) { System.out.println(s); } System.out.println(""); Collator coll = Collator.getInstance(Locale.JAPAN); System.out.println("collation sort:"); Arrays.sort(arr, coll); for (String s : arr) { System.out.println(s); } } |
1 2 3 4 5 6 7 8 9 10 11 12 |
Output: normal sort: かか がが さ カイ collation sort: カイ かか がが さ |
Summary
- Unicode and Java have resolved most of i18n issues
- Still, application programs should take care of some issues
- Please improve your system with a proper i18n knowledge
関連文書:
- 関連文書は見つからんがな
おおたに
> プレゼンを聞いた某氏から、文字コードの話しかしていないじゃないですか、と嫌味を言われました。
ひどいことを言う人もいるものですね。