Internationalization

INOUE Seiichiro

Ariel Networks, Inc. CTO

Works Applications Guest Fellow

Who I am

[any material that should appear in print but not on the slide]

Who I am (cont.)

[any material that should appear in print but not on the slide]
[any material that should appear in print but not on the slide]

Today's Topic

[any material that should appear in print but not on the slide]

Agenda

  1. i18n overview
  2. languages and characters in computer
  3. History of character code
  4. Unicode & Java String
[any material that should appear in print but not on the slide]

1. i18n overview

[any material that should appear in print but not on the slide]

What i18n is

making software handle country/region/culture specific things

[any material that should appear in print but not on the slide]

locale

country/region/culture specific things

=> locale

e.g.
On Unix system
$ man locale   # show the manual
$ locale -a    # show all locales supported by your system
$ echo $LANG   # show the current (shell) locale
[any material that should appear in print but not on the slide]

i18n examples (1)

[any material that should appear in print but not on the slide]

i18n examples (2)

[any material that should appear in print but not on the slide]

i18n examples (3)

[any material that should appear in print but not on the slide]

i18n examples (4)

[any material that should appear in print but not on the slide]

i18n evolution

  Traditionally, 
    let software have per-process locale

  Nowadays,
    let software have per-user / per-content locale
[any material that should appear in print but not on the slide]

2. languages and characters in computer

[any material that should appear in print but not on the slide]

what we should do

two steps:

  1. define character set
  2. assign number (=code point) to each character
[any material that should appear in print but not on the slide]

in order to define character set

we should know what characters are

definition of character
the elements of written languages
[any material that should appear in print but not on the slide]

Question; what is character?

  a, A, あ, 愛, @, ...
[any material that should appear in print but not on the slide]

Question; what is character? (cont.)

[any material that should appear in print but not on the slide]

Question; what is character? (cont.)

[any material that should appear in print but not on the slide]

language examples

[any material that should appear in print but not on the slide]

language/script separation

   some languages uses multiple groups of characters.
     e.g. Japanese uses 'hiragana', 'katakana', 'kanji', 'number', 'mathematical symbols', etc.

   some characters are used by multiple languages.
     e.g. Many languages uses 'latin alphabet', 'number', etc.

    'a group of characters' 
       => script
[any material that should appear in print but not on the slide]

script examples

[any material that should appear in print but not on the slide]

language/script mapping

            M   N
   language <---> script

   cf.
   http://www.unicode.org/cldr/charts/supplemental/languages_and_scripts.html
[any material that should appear in print but not on the slide]

character set

  define a character set from scripts

                 M   N
   character set <---> script
[any material that should appear in print but not on the slide]

character set examples


Unicode (try to contain all scripts in the world)

[any material that should appear in print but not on the slide]

Where should languages be taken care of?

   According to Unicode standard, 
     it is application's responsibility

    e.g. HTML's lang attribute
         <html lang="ja">
[any material that should appear in print but not on the slide]

3. History of character code

[any material that should appear in print but not on the slide]

brief history of character set

[any material that should appear in print but not on the slide]

brief history of character set (cont.)

[any material that should appear in print but not on the slide]

Terms

[any material that should appear in print but not on the slide]

European languages history

 ASCII(ISO 646): English alphabet
 ISO 8859: various European alphabets

   => Both are 1 byte(octet) char
   => Precisely, 7-bit char
   => easy to use them simultaneously
      (determine by the highest order bit)
[any material that should appear in print but not on the slide]

Japanese history (1)

 JIS 0201: half-width katakana (a.k.a. hankaku kana)
   - 1 byte char (7-bit char)
   - similar to ISO 8859
   - easy to use ASCII and JIS 0201 simultaneously
[any material that should appear in print but not on the slide]

Japanese history (2)

 JIS 0208: Hiragana, Katakana, Kanji (a.k.a. zenkaku(full-width))
   - 2 byte char
   - easy to use ASCII and JIS 0208 simultaneously
     (determine by the highest order bit)
   - not easy to use JIS 0208 and JIS 0201 simultaneously
   - ISO 8859 have the same issue
      => ISO 2022
[any material that should appear in print but not on the slide]

ISO 2022

  How can we use JIS 0208(Kanji) with ISO 8859(European scripts)?
   => Switch multiple character sets by control characters
      (in some cases, we can omit control characters)
  
  e.g.

  abcああabc  -->   0x61 0x62 0x63 
                      0x1b 0x24 0x42 
                      0x24 0x22 0x24 0x22 
                      0x1b 0x28 0x42 
                      0x61 0x62 0x63
[any material that should appear in print but not on the slide]

Japanese history (3)

  Legacy code
   - EUC-JP (ISO 2022 compatible)
   - Shift-JIS (Microsoft code)
   - ISO 2022 JP (a.k.a. JIS code)

   => being replaced by Unicode
[any material that should appear in print but not on the slide]

4. Unicode & Java String

[any material that should appear in print but not on the slide]

Unicode

The Unicode Consortium and The Unicode Standard

http://www.unicode.org/

[any material that should appear in print but not on the slide]

Unicode brief history

  1. Unicode 88 ...16bit code
  2. Original ISO 10646 ...32bit code
  3. ISO 10646 accepted Unicode ... super set of Unicode. 32bit code
  4. ISO 10646/Unicode grows ...21bit code
[any material that should appear in print but not on the slide]

ISO/IEC 10646

 UCS(Universal Coded Character Set)

 UCS-2: code range: U+0000 ... U+FFFF
 UCS-4: code range: U+000000 ... U+10FFFF

  => UCS-2 == BMP(Basic Multilingual Plane) of UCS-4
  => UCS-2 is deprecated, so that UCS almost means UCS-4
[any material that should appear in print but not on the slide]

Encoding scheme

UTF(Unicode/UCS Transformation Format)

[any material that should appear in print but not on the slide]

UTF-8

[any material that should appear in print but not on the slide]

UTF-8 (cont.)

UCS Code point                   UTF-8
00000000 0aaaaaaa           -->  0aaaaaaa
00000aaa aabbbbbb           -->  110aaaaa 10bbbbbb
aaaabbbb bbcccccc           -->  1110aaaa 10bbbbbb 10cccccc
000aaabb bbbbcccc ccdddddd  -->  11110aaa 10bbbbbb 10cccccc 10dddddd
[any material that should appear in print but not on the slide]

UTF-8 conversion example

[any material that should appear in print but not on the slide]

UTF-8 conversion example (cont.)

  あ
  U+3042
  00110000 01000010
  aaaabbbb bbcccccc
  
  == UTF-8 conversion ==

  11100011 10000001 10000010
      aaaa   bbbbbb   cccccc
  0xe3     0x81     0x82
[any material that should appear in print but not on the slide]

UTF-16

[any material that should appear in print but not on the slide]

UTF-16 Surrogate pair

(1) U+0000 - U+ffff
    (except U+d800-U+dfff)

    => 16bit encoding

(2) U+1ffff - U+10ffff

  000a aaaa xxxx xxyy yyyy yyyy  -> 1101 10ww wwxx xxxx  1101 11yy yyyy yyyy
                           21bit -> 32bit

  (wwww = aaaaa - 1)
[any material that should appear in print but not on the slide]

UTF-16 Surrogate pair(cont.)

  Reserved area for surrogate pair (= a kind of control code)
    U+d800 - U+dbff : 11011000 00000000 - 11011011 11111111
      and
    U+dc00 - U+dfff : 11011100 00000000 - 11011111 11111111
[any material that should appear in print but not on the slide]

UTF-16 conversion example

[any material that should appear in print but not on the slide]

UTF-16 conversion example (cont.)

 𩸕
 U+29e15
 00000010 10011110 00010101
    aaaaa xxxxxxyy yyyyyyyy

 == UTF-16 conversion ==

 11011000 01100111 11011110 00010101
       ww wwxxxxxx       yy yyyyyyyy
 U+d867 U+de15
[any material that should appear in print but not on the slide]

UTF-32

[any material that should appear in print but not on the slide]

fyi, Unicode character database

[any material that should appear in print but not on the slide]

Java String

[any material that should appear in print but not on the slide]

Java String's surrogate pair

public static void main(String[] args) {
    String s = "\ud867\ude15";  // UTF-16 surrogate pair
    System.out.println(s.length());  //=>2
    System.out.println(s.codePointCount(0, s.length()));  //=>1
}
[any material that should appear in print but not on the slide]

Composed characters

=> normalization (to canonical format)

[any material that should appear in print but not on the slide]

Java String's composed characters

public static void main(String[] args) {
    String s = "\u3052";
    System.out.println(s + ", length=" + s.length());

    s = "\u3051\u3099";
    System.out.println(s + ", length=" + s.length());

    s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC);
    System.out.println(s + ", length=" + s.length());
}
[any material that should appear in print but not on the slide]

Java String's composed characters(cont.)

Output:
げ, length=1
げ, length=2
げ, length=1
[any material that should appear in print but not on the slide]

Java String's composed characters (cont.)

[any material that should appear in print but not on the slide]

Java String's composed characters (cont.)

[any material that should appear in print but not on the slide]

Java String's composed characters (cont.)

public static void main(String[] args) {
    // devanagari
    String s = "\u0928\u092e\u0938\u094d\u0924\u0947";
    System.out.println(s + ", length=" + s.length());

    s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC);
    System.out.println(s + ", length=" + s.length());
}
[any material that should appear in print but not on the slide]

Java String's composed characters (cont.)

Output:
नमस्ते, length=6
नमस्ते, length=6
[any material that should appear in print but not on the slide]

What collation is

sort and equality algorithm of characters

[any material that should appear in print but not on the slide]

Unicode collation

[any material that should appear in print but not on the slide]

Java String's collation

public static void main(String[] args) throws Exception {
    String[] arr = new String[] { "かか", "がが", "カイ", "さ" };
    System.out.println("normal sort:");
    Arrays.sort(arr);
    for (String s : arr) {
        System.out.println(s);
    }
    System.out.println("");

    Collator coll = Collator.getInstance(Locale.JAPAN);
    System.out.println("collation sort:");
    Arrays.sort(arr, coll);
    for (String s : arr) {
        System.out.println(s);
    }
}
[any material that should appear in print but not on the slide]

Output:
normal sort:
かか
がが
さ
カイ

collation sort:
カイ
かか
がが
さ
[any material that should appear in print but not on the slide]

Summary

[any material that should appear in print but not on the slide]