Internationalization
INOUE Seiichiro
Ariel Networks, Inc. CTO
Works Applications Guest Fellow
Who I am
- CTO of Ariel Networks, Inc.
- Guest Fellow of ATE division at Works Applications
- Formerly, a developer of Lotus Notes at Boston
[any material that should appear in print but not on the slide]
Who I am (cont.)
- My books and articles:
- "Perfect Java"
- "Perfect JavaScript"
- "P2P textbook"
- "Server Side JavaScript"
- etc.
[any material that should appear in print but not on the slide]
[any material that should appear in print but not on the slide]
Today's Topic
- Internationalization (in short, i18n)
- history and concept
- character code
[any material that should appear in print but not on the slide]
Agenda
- i18n overview
- languages and characters in computer
- History of character code
- Unicode & Java String
[any material that should appear in print but not on the slide]
1. i18n overview
[any material that should appear in print but not on the slide]
What i18n is
making software handle
country/region/culture specific things
[any material that should appear in print but not on the slide]
locale
country/region/culture specific things
=> locale
e.g.
On Unix system
$ man locale # show the manual
$ locale -a # show all locales supported by your system
$ echo $LANG # show the current (shell) locale
[any material that should appear in print but not on the slide]
i18n examples (1)
[any material that should appear in print but not on the slide]
i18n examples (2)
- timezone
- calendar (e.g. Japanese era, holidays, etc.)
- date/time format (e.g. 2013/7/3, Jul/3/2013)
[any material that should appear in print but not on the slide]
i18n examples (3)
- number format (e.g. 100,000.00 symbol for decimal point)
- monetary/currency (e.g. dollar, yen, rupee)
- measurement units (e.g. mile, feet)
[any material that should appear in print but not on the slide]
i18n examples (4)
- name format (e.g. order of family name/given name, middle name, Mr/Ms)
- address format (e.g. postal code, the order of address elements)
- telephone number format (e.g. number of digits)
- icon (e.g. postbox)
[any material that should appear in print but not on the slide]
i18n evolution
Traditionally,
let software have per-process locale
Nowadays,
let software have per-user / per-content locale
[any material that should appear in print but not on the slide]
2. languages and characters in computer
[any material that should appear in print but not on the slide]
what we should do
two steps:
- define character set
- assign number (=code point) to each character
[any material that should appear in print but not on the slide]
in order to define character set
we should know what characters are
- definition of character
- the elements of written languages
[any material that should appear in print but not on the slide]
Question; what is character?
a, A, あ, 愛, @, ...
[any material that should appear in print but not on the slide]
Question; what is character? (cont.)
[any material that should appear in print but not on the slide]
Question; what is character? (cont.)
- resolved only by agreement
- we have to think about languages
[any material that should appear in print but not on the slide]
language examples
- Japanese
- English
- French
- Hindi
- Marathi
- Chinese
- ...
[any material that should appear in print but not on the slide]
language/script separation
some languages uses multiple groups of characters.
e.g. Japanese uses 'hiragana', 'katakana', 'kanji', 'number', 'mathematical symbols', etc.
some characters are used by multiple languages.
e.g. Many languages uses 'latin alphabet', 'number', etc.
'a group of characters'
=> script
[any material that should appear in print but not on the slide]
script examples
- latin alphabet
- hiragana
- katakana
- kanji
- hangul
- devanagari
- ...
[any material that should appear in print but not on the slide]
language/script mapping
M N
language <---> script
cf.
http://www.unicode.org/cldr/charts/supplemental/languages_and_scripts.html
[any material that should appear in print but not on the slide]
character set
define a character set from scripts
M N
character set <---> script
[any material that should appear in print but not on the slide]
character set examples
- ASCII (latin alphabet, some symbols)
- JIS 0208 (latin alphabet, hiragana, katakana, kanji, etc.)
- ISCII (devanagari, etc)
- ...
Unicode (try to contain all scripts in the world)
[any material that should appear in print but not on the slide]
Where should languages be taken care of?
According to Unicode standard,
it is application's responsibility
e.g. HTML's lang attribute
<html lang="ja">
[any material that should appear in print but not on the slide]
3. History of character code
[any material that should appear in print but not on the slide]
brief history of character set
- 1963 ASCII
- 1967 ISO/IEC 646
- 1969 JIS 0201
- 1973 ISO 2022
- 1978 JIS 0208
- 1982 CP932(MS-DOS)
- 1985 EUC-JP
- 1987 ISO 8859
[any material that should appear in print but not on the slide]
brief history of character set (cont.)
- 1988 Unicode 88
- 1991 ISCII
- 1992 Unicode v1.0
- 1993 ISO 10646
- ...(Unicode version up)
- 2012 Unicode v6.2
[any material that should appear in print but not on the slide]
Terms
- ISO: International Organization for Standardization
- IEC: International Electrotechnical Commission
- JIS: Japanese Industrial Standards
- ASCII: American standard code for information interchange
- ISCII: Indian Script Code for Information Interchange
[any material that should appear in print but not on the slide]
European languages history
ASCII(ISO 646): English alphabet
ISO 8859: various European alphabets
=> Both are 1 byte(octet) char
=> Precisely, 7-bit char
=> easy to use them simultaneously
(determine by the highest order bit)
[any material that should appear in print but not on the slide]
Japanese history (1)
JIS 0201: half-width katakana (a.k.a. hankaku kana)
- 1 byte char (7-bit char)
- similar to ISO 8859
- easy to use ASCII and JIS 0201 simultaneously
[any material that should appear in print but not on the slide]
Japanese history (2)
JIS 0208: Hiragana, Katakana, Kanji (a.k.a. zenkaku(full-width))
- 2 byte char
- easy to use ASCII and JIS 0208 simultaneously
(determine by the highest order bit)
- not easy to use JIS 0208 and JIS 0201 simultaneously
- ISO 8859 have the same issue
=> ISO 2022
[any material that should appear in print but not on the slide]
ISO 2022
How can we use JIS 0208(Kanji) with ISO 8859(European scripts)?
=> Switch multiple character sets by control characters
(in some cases, we can omit control characters)
e.g.
abcああabc --> 0x61 0x62 0x63
0x1b 0x24 0x42
0x24 0x22 0x24 0x22
0x1b 0x28 0x42
0x61 0x62 0x63
[any material that should appear in print but not on the slide]
Japanese history (3)
Legacy code
- EUC-JP (ISO 2022 compatible)
- Shift-JIS (Microsoft code)
- ISO 2022 JP (a.k.a. JIS code)
=> being replaced by Unicode
[any material that should appear in print but not on the slide]
4. Unicode & Java String
[any material that should appear in print but not on the slide]
Unicode
The Unicode Consortium and The Unicode Standard
http://www.unicode.org/
- character set
- encoding scheme
- collation rule
- various algorithm (e.g. BiDi)
[any material that should appear in print but not on the slide]
Unicode brief history
- Unicode 88 ...16bit code
- Original ISO 10646 ...32bit code
- ISO 10646 accepted Unicode ... super set of Unicode. 32bit code
- ISO 10646/Unicode grows ...21bit code
[any material that should appear in print but not on the slide]
ISO/IEC 10646
UCS(Universal Coded Character Set)
UCS-2: code range: U+0000 ... U+FFFF
UCS-4: code range: U+000000 ... U+10FFFF
=> UCS-2 == BMP(Basic Multilingual Plane) of UCS-4
=> UCS-2 is deprecated, so that UCS almost means UCS-4
[any material that should appear in print but not on the slide]
Encoding scheme
UTF(Unicode/UCS Transformation Format)
- UTF-8
- UTF-16
- UTF-32
- (UTF-7)
[any material that should appear in print but not on the slide]
UTF-8
- 8-bit variable-width encoding (a.k.a. multi-byte char)
- upper compatible with ASCII
- getting the standard in Internet
[any material that should appear in print but not on the slide]
UTF-8 (cont.)
UCS Code point UTF-8
00000000 0aaaaaaa --> 0aaaaaaa
00000aaa aabbbbbb --> 110aaaaa 10bbbbbb
aaaabbbb bbcccccc --> 1110aaaa 10bbbbbb 10cccccc
000aaabb bbbbcccc ccdddddd --> 11110aaa 10bbbbbb 10cccccc 10dddddd
[any material that should appear in print but not on the slide]
UTF-8 conversion example
[any material that should appear in print but not on the slide]
UTF-8 conversion example (cont.)
あ
U+3042
00110000 01000010
aaaabbbb bbcccccc
== UTF-8 conversion ==
11100011 10000001 10000010
aaaa bbbbbb cccccc
0xe3 0x81 0x82
[any material that should appear in print but not on the slide]
UTF-16
- 16-bit, variable-width encoding
- Most of characters are encoded in 16-bit fixed-width (same as BMP/UCS-2)
- Using surrogate pair for characters other than BMP
- Sometimes, use BOM(Byte Order Mark)
- Java string internal encoding scheme
[any material that should appear in print but not on the slide]
UTF-16 Surrogate pair
(1) U+0000 - U+ffff
(except U+d800-U+dfff)
=> 16bit encoding
(2) U+1ffff - U+10ffff
000a aaaa xxxx xxyy yyyy yyyy -> 1101 10ww wwxx xxxx 1101 11yy yyyy yyyy
21bit -> 32bit
(wwww = aaaaa - 1)
[any material that should appear in print but not on the slide]
UTF-16 Surrogate pair(cont.)
Reserved area for surrogate pair (= a kind of control code)
U+d800 - U+dbff : 11011000 00000000 - 11011011 11111111
and
U+dc00 - U+dfff : 11011100 00000000 - 11011111 11111111
[any material that should appear in print but not on the slide]
UTF-16 conversion example
[any material that should appear in print but not on the slide]
UTF-16 conversion example (cont.)
𩸕
U+29e15
00000010 10011110 00010101
aaaaa xxxxxxyy yyyyyyyy
== UTF-16 conversion ==
11011000 01100111 11011110 00010101
ww wwxxxxxx yy yyyyyyyy
U+d867 U+de15
[any material that should appear in print but not on the slide]
UTF-32
- 32-bit fixed-width encoding
- Caution: considering composed characters, one character still cannot be 32-bit width
[any material that should appear in print but not on the slide]
fyi, Unicode character database
- http://www.unicode.org/charts/
- http://www.unicode.org/charts/charindex.html
[any material that should appear in print but not on the slide]
Java String
- The internal encoding is UTF-16
- Not UCS-2, BMP
- (Before Java 1.4, it used to be UCS-2)
- Without considering surrogate pair and composed characters, easy to use (16-bit fixed width encoding)
[any material that should appear in print but not on the slide]
Java String's surrogate pair
public static void main(String[] args) {
String s = "\ud867\ude15"; // UTF-16 surrogate pair
System.out.println(s.length()); //=>2
System.out.println(s.codePointCount(0, s.length())); //=>1
}
[any material that should appear in print but not on the slide]
Composed characters
- precomposed character
- combining
=> normalization (to canonical format)
[any material that should appear in print but not on the slide]
Java String's composed characters
public static void main(String[] args) {
String s = "\u3052";
System.out.println(s + ", length=" + s.length());
s = "\u3051\u3099";
System.out.println(s + ", length=" + s.length());
s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC);
System.out.println(s + ", length=" + s.length());
}
[any material that should appear in print but not on the slide]
Java String's composed characters(cont.)
Output:
げ, length=1
げ, length=2
げ, length=1
[any material that should appear in print but not on the slide]
Java String's composed characters (cont.)
[any material that should appear in print but not on the slide]
Java String's composed characters (cont.)
[any material that should appear in print but not on the slide]
Java String's composed characters (cont.)
public static void main(String[] args) {
// devanagari
String s = "\u0928\u092e\u0938\u094d\u0924\u0947";
System.out.println(s + ", length=" + s.length());
s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC);
System.out.println(s + ", length=" + s.length());
}
[any material that should appear in print but not on the slide]
Java String's composed characters (cont.)
Output:
नमस्ते, length=6
नमस्ते, length=6
[any material that should appear in print but not on the slide]
What collation is
sort and equality algorithm of characters
- In ASCII era, collation is similar to character code
- However, we sometimes need a different sort algorithm such as case-ignore
- usual order: A,B,a,b
- case-ignore order: A,a,B,b
[any material that should appear in print but not on the slide]
Unicode collation
- Each character has a collation value
[any material that should appear in print but not on the slide]
Java String's collation
public static void main(String[] args) throws Exception {
String[] arr = new String[] { "かか", "がが", "カイ", "さ" };
System.out.println("normal sort:");
Arrays.sort(arr);
for (String s : arr) {
System.out.println(s);
}
System.out.println("");
Collator coll = Collator.getInstance(Locale.JAPAN);
System.out.println("collation sort:");
Arrays.sort(arr, coll);
for (String s : arr) {
System.out.println(s);
}
}
[any material that should appear in print but not on the slide]
Output:
normal sort:
かか
がが
さ
カイ
collation sort:
カイ
かか
がが
さ
[any material that should appear in print but not on the slide]
Summary
- Unicode and Java have resolved most of i18n issues
- Still, application programs should take care of some issues
- Please improve your system with a proper i18n knowledge
[any material that should appear in print but not on the slide]