Internationalization

INOUE Seiichiro

Ariel Networks, Inc. CTO

Works Applications Guest Fellow

Who I am

CTO of Ariel Networks, Inc.
Guest Fellow of ATE division at Works Applications
Formerly, a developer of Lotus Notes at Boston

Who I am (cont.)

My books and articles:

“Perfect Java”
“Perfect JavaScript”
“P2P textbook”
“Server Side JavaScript”
etc.

Today’s Topic

Internationalization (in short, i18n)

history and concept
character code

Agenda

i18n overview
languages and characters in computer
History of character code
Unicode & Java String

1. i18n overview

What i18n is

making software handle
country/region/culture specific things

locale

country/region/culture specific things

=> locale

e.g.
On Unix system
$ man locale   # show the manual
$ locale -a    # show all locales supported by your system
$ echo $LANG   # show the current (shell) locale

1

2

3

4

5

e.g.

On Unix system

$ man locale # show the manual

$ locale -a # show all locales supported by your system

$ echo $LANG # show the current (shell) locale

i18n examples (1)

language

input
output

i18n examples (2)

timezone
calendar (e.g. Japanese era, holidays, etc.)
date/time format (e.g. 2013/7/3, Jul/3/2013)

i18n examples (3)

number format (e.g. 100,000.00 symbol for decimal point)
monetary/currency (e.g. dollar, yen, rupee)
measurement units (e.g. mile, feet)

i18n examples (4)

name format (e.g. order of family name/given name, middle name, Mr/Ms)
address format (e.g. postal code, the order of address elements)
telephone number format (e.g. number of digits)
icon (e.g. postbox)

i18n evolution

  Traditionally, 
    let software have per-process locale

  Nowadays,
    let software have per-user / per-content locale

1

2

3

4

5

Traditionally,

let software have per-process locale

Nowadays,

let software have per-user / per-content locale

2. languages and characters in computer

what we should do

two steps:

define character set
assign number (=code point) to each character

in order to define character set

we should know what characters are

definition of character: the elements of written languages

Question; what is character?

  a, A, あ, 愛, @, ...

1	a, A, あ, 愛, @, ...

Question; what is character? (cont.)

resolved only by agreement
we have to think about languages

language examples

Japanese
English
French
Hindi
Marathi
Chinese
…

language/script separation

   some languages uses multiple groups of characters.
     e.g. Japanese uses 'hiragana', 'katakana', 'kanji', 'number', 'mathematical symbols', etc.

   some characters are used by multiple languages.
     e.g. Many languages uses 'latin alphabet', 'number', etc.

    'a group of characters' 
       =&gt; script

1

2

3

4

5

6

7

8

some languages uses multiple groups of characters.

e.g. Japanese uses 'hiragana', 'katakana', 'kanji', 'number', 'mathematical symbols', etc.

some characters are used by multiple languages.

e.g. Many languages uses 'latin alphabet', 'number', etc.

'a group of characters'

=> script

script examples

latin alphabet
hiragana
katakana
kanji
hangul
devanagari
…

language/script mapping

            M   N
   language &lt;---&gt; script

   cf.
   http://www.unicode.org/cldr/charts/supplemental/languages_and_scripts.html

1

2

3

4

5

M N

language <---> script

cf.

http://www.unicode.org/cldr/charts/supplemental/languages_and_scripts.html

character set

  define a character set from scripts

                 M   N
   character set &lt;---&gt; script

1

2

3

4

define a character set from scripts

M N

character set <---> script

character set examples

ASCII (latin alphabet, some symbols)
JIS 0208 (latin alphabet, hiragana, katakana, kanji, etc.)
ISCII (devanagari, etc)
…

Unicode (try to contain all scripts in the world)

Where should languages be taken care of?

   According to Unicode standard, 
     it is application's responsibility

    e.g. HTML's lang attribute
         &lt;html lang="ja"&gt;

1

2

3

4

5

According to Unicode standard,

it is application's responsibility

e.g. HTML's lang attribute

3. History of character code

brief history of character set

1963 ASCII
1967 ISO/IEC 646
1969 JIS 0201
1973 ISO 2022
1978 JIS 0208
1982 CP932(MS-DOS)
1985 EUC-JP
1987 ISO 8859

brief history of character set (cont.)

1988 Unicode 88
1991 ISCII
1992 Unicode v1.0
1993 ISO 10646
…(Unicode version up)
2012 Unicode v6.2

Terms

ISO: International Organization for Standardization
IEC: International Electrotechnical Commission
JIS: Japanese Industrial Standards
ASCII: American standard code for information interchange
ISCII: Indian Script Code for Information Interchange

European languages history

 ASCII(ISO 646): English alphabet
 ISO 8859: various European alphabets

   =&gt; Both are 1 byte(octet) char
   =&gt; Precisely, 7-bit char
   =&gt; easy to use them simultaneously
      (determine by the highest order bit)

1

2

3

4

5

6

7

ASCII(ISO 646): English alphabet

ISO 8859: various European alphabets

=> Both are 1 byte(octet) char

=> Precisely, 7-bit char

=> easy to use them simultaneously

(determine by the highest order bit)

Japanese history (1)

 JIS 0201: half-width katakana (a.k.a. hankaku kana)
   - 1 byte char (7-bit char)
   - similar to ISO 8859
   - easy to use ASCII and JIS 0201 simultaneously

1

2

3

4

JIS 0201: half-width katakana (a.k.a. hankaku kana)

- 1 byte char (7-bit char)

- similar to ISO 8859

- easy to use ASCII and JIS 0201 simultaneously

Japanese history (2)

 JIS 0208: Hiragana, Katakana, Kanji (a.k.a. zenkaku(full-width))
   - 2 byte char
   - easy to use ASCII and JIS 0208 simultaneously
     (determine by the highest order bit)
   - not easy to use JIS 0208 and JIS 0201 simultaneously
   - ISO 8859 have the same issue
      =&gt; ISO 2022

1

2

3

4

5

6

7

JIS 0208: Hiragana, Katakana, Kanji (a.k.a. zenkaku(full-width))

- 2 byte char

- easy to use ASCII and JIS 0208 simultaneously

(determine by the highest order bit)

- not easy to use JIS 0208 and JIS 0201 simultaneously

- ISO 8859 have the same issue

=> ISO 2022

ISO 2022

  How can we use JIS 0208(Kanji) with ISO 8859(European scripts)?
   =&gt; Switch multiple character sets by control characters
      (in some cases, we can omit control characters)
  
  e.g.

  abcああabc  --&gt;   0x61 0x62 0x63 
                      0x1b 0x24 0x42 
                      0x24 0x22 0x24 0x22 
                      0x1b 0x28 0x42 
                      0x61 0x62 0x63

1

2

3

4

5

6

7

8

9

10

11

How can we use JIS 0208(Kanji) with ISO 8859(European scripts)?

=> Switch multiple character sets by control characters

(in some cases, we can omit control characters)

e.g.

abcああabc --> 0x61 0x62 0x63

0x1b 0x24 0x42

0x24 0x22 0x24 0x22

0x1b 0x28 0x42

0x61 0x62 0x63

Japanese history (3)

  Legacy code
   - EUC-JP (ISO 2022 compatible)
   - Shift-JIS (Microsoft code)
   - ISO 2022 JP (a.k.a. JIS code)

   =&gt; being replaced by Unicode

1

2

3

4

5

6

Legacy code

- EUC-JP (ISO 2022 compatible)

- Shift-JIS (Microsoft code)

- ISO 2022 JP (a.k.a. JIS code)

=> being replaced by Unicode

4. Unicode & Java String

Unicode

The Unicode Consortium and The Unicode Standard

http://www.unicode.org/

character set
encoding scheme
collation rule
various algorithm (e.g. BiDi)

Unicode brief history

Unicode 88 …16bit code
Original ISO 10646 …32bit code
ISO 10646 accepted Unicode … super set of Unicode. 32bit code
ISO 10646/Unicode grows …21bit code

ISO/IEC 10646

 UCS(Universal Coded Character Set)

 UCS-2: code range: U+0000 ... U+FFFF
 UCS-4: code range: U+000000 ... U+10FFFF

  =&gt; UCS-2 == BMP(Basic Multilingual Plane) of UCS-4
  =&gt; UCS-2 is deprecated, so that UCS almost means UCS-4

1

2

3

4

5

6

7

UCS(Universal Coded Character Set)

UCS-2: code range: U+0000 ... U+FFFF

UCS-4: code range: U+000000 ... U+10FFFF

=> UCS-2 == BMP(Basic Multilingual Plane) of UCS-4

=> UCS-2 is deprecated, so that UCS almost means UCS-4

Encoding scheme

UTF(Unicode/UCS Transformation Format)

UTF-8
UTF-16
UTF-32
(UTF-7)

UTF-8

8-bit variable-width encoding (a.k.a. multi-byte char)
upper compatible with ASCII
getting the standard in Internet

UTF-8 (cont.)

UCS Code point                   UTF-8
00000000 0aaaaaaa           --&gt;  0aaaaaaa
00000aaa aabbbbbb           --&gt;  110aaaaa 10bbbbbb
aaaabbbb bbcccccc           --&gt;  1110aaaa 10bbbbbb 10cccccc
000aaabb bbbbcccc ccdddddd  --&gt;  11110aaa 10bbbbbb 10cccccc 10dddddd

1

2

3

4

5

UCS Code point UTF-8

00000000 0aaaaaaa --> 0aaaaaaa

00000aaa aabbbbbb --> 110aaaaa 10bbbbbb

aaaabbbb bbcccccc --> 1110aaaa 10bbbbbb 10cccccc

000aaabb bbbbcccc ccdddddd --> 11110aaa 10bbbbbb 10cccccc 10dddddd

UTF-8 conversion example

UTF-8 conversion example (cont.)

  あ
  U+3042
  00110000 01000010
  aaaabbbb bbcccccc
  
  == UTF-8 conversion ==

  11100011 10000001 10000010
      aaaa   bbbbbb   cccccc
  0xe3     0x81     0x82

1

2

3

4

5

6

7

8

9

10

あ

U+3042

00110000 01000010

aaaabbbb bbcccccc

== UTF-8 conversion ==

11100011 10000001 10000010

aaaa bbbbbb cccccc

0xe3 0x81 0x82

UTF-16

16-bit, variable-width encoding
Most of characters are encoded in 16-bit fixed-width (same as BMP/UCS-2)
Using surrogate pair for characters other than BMP
Sometimes, use BOM(Byte Order Mark)
Java string internal encoding scheme

UTF-16 Surrogate pair

(1) U+0000 - U+ffff
    (except U+d800-U+dfff)

    =&gt; 16bit encoding

(2) U+1ffff - U+10ffff

  000a aaaa xxxx xxyy yyyy yyyy  -&gt; 1101 10ww wwxx xxxx  1101 11yy yyyy yyyy
                           21bit -&gt; 32bit

  (wwww = aaaaa - 1)

1

2

3

4

5

6

7

8

9

10

11

(1) U+0000 - U+ffff

(except U+d800-U+dfff)

=> 16bit encoding

(2) U+1ffff - U+10ffff

000a aaaa xxxx xxyy yyyy yyyy -> 1101 10ww wwxx xxxx 1101 11yy yyyy yyyy

21bit -> 32bit

(wwww = aaaaa - 1)

UTF-16 Surrogate pair(cont.)

  Reserved area for surrogate pair (= a kind of control code)
    U+d800 - U+dbff : 11011000 00000000 - 11011011 11111111
      and
    U+dc00 - U+dfff : 11011100 00000000 - 11011111 11111111

1

2

3

4

Reserved area for surrogate pair (= a kind of control code)

U+d800 - U+dbff : 11011000 00000000 - 11011011 11111111

and

U+dc00 - U+dfff : 11011100 00000000 - 11011111 11111111

UTF-16 conversion example

UTF-16 conversion example (cont.)

  U+29e15
 00000010 10011110 00010101
    aaaaa xxxxxxyy yyyyyyyy

 == UTF-16 conversion ==

 11011000 01100111 11011110 00010101
       ww wwxxxxxx       yy yyyyyyyy
 U+d867 U+de15

1

2

3

4

5

6

7

8

9

U+29e15

00000010 10011110 00010101

aaaaa xxxxxxyy yyyyyyyy

== UTF-16 conversion ==

11011000 01100111 11011110 00010101

ww wwxxxxxx yy yyyyyyyy

U+d867 U+de15

UTF-32

32-bit fixed-width encoding
Caution: considering composed characters, one character still cannot be 32-bit width

fyi, Unicode character database

http://www.unicode.org/charts/
http://www.unicode.org/charts/charindex.html

Java String

The internal encoding is UTF-16
Not UCS-2, BMP
(Before Java 1.4, it used to be UCS-2)
Without considering surrogate pair and composed characters, easy to use (16-bit fixed width encoding)

Java String’s surrogate pair

public static void main(String[] args) {
    String s = "\ud867\ude15";  // UTF-16 surrogate pair
    System.out.println(s.length());  //=&gt;2
    System.out.println(s.codePointCount(0, s.length()));  //=&gt;1
}

1

2

3

4

5

public static void main(String[] args) {

String s = "\ud867\ude15"; // UTF-16 surrogate pair

System.out.println(s.length()); //=>2

System.out.println(s.codePointCount(0, s.length())); //=>1

}

Composed characters

precomposed character
combining

=> normalization (to canonical format)

Java String’s composed characters

public static void main(String[] args) {
    String s = "\u3052";
    System.out.println(s + ", length=" + s.length());

    s = "\u3051\u3099";
    System.out.println(s + ", length=" + s.length());

    s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC);
    System.out.println(s + ", length=" + s.length());
}

1

2

3

4

5

6

7

8

9

10

public static void main(String[] args) {

String s = "\u3052";

System.out.println(s + ", length=" + s.length());

s = "\u3051\u3099";

System.out.println(s + ", length=" + s.length());

s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC);

System.out.println(s + ", length=" + s.length());

}

Java String’s composed characters(cont.)

Output:
げ, length=1
げ, length=2
げ, length=1

1

2

3

4

Output:

げ, length=1

げ, length=2

げ, length=1

Java String’s composed characters (cont.)

public static void main(String[] args) {
    // devanagari
    String s = "\u0928\u092e\u0938\u094d\u0924\u0947";
    System.out.println(s + ", length=" + s.length());

    s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC);
    System.out.println(s + ", length=" + s.length());
}

1

2

3

4

5

6

7

8

public static void main(String[] args) {

// devanagari

String s = "\u0928\u092e\u0938\u094d\u0924\u0947";

System.out.println(s + ", length=" + s.length());

s = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC);

System.out.println(s + ", length=" + s.length());

}

Java String’s composed characters (cont.)

Output:
नमस्ते, length=6
नमस्ते, length=6

1

2

3

Output:

नमस्ते, length=6

What collation is

sort and equality algorithm of characters

In ASCII era, collation is similar to character code
However, we sometimes need a different sort algorithm such as case-ignore

usual order: A,B,a,b
case-ignore order: A,a,B,b

Unicode collation

Each character has a collation value

Java String’s collation

public static void main(String[] args) throws Exception {
    String[] arr = new String[] { "かか", "がが", "カイ", "さ" };
    System.out.println("normal sort:");
    Arrays.sort(arr);
    for (String s : arr) {
        System.out.println(s);
    }
    System.out.println("");

    Collator coll = Collator.getInstance(Locale.JAPAN);
    System.out.println("collation sort:");
    Arrays.sort(arr, coll);
    for (String s : arr) {
        System.out.println(s);
    }
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

public static void main(String[] args) throws Exception {

String[] arr = new String[] { "かか", "がが", "カイ", "さ" };

System.out.println("normal sort:");

Arrays.sort(arr);

for (String s : arr) {

System.out.println(s);

}

System.out.println("");

Collator coll = Collator.getInstance(Locale.JAPAN);

System.out.println("collation sort:");

Arrays.sort(arr, coll);

for (String s : arr) {

System.out.println(s);

}

Output:
normal sort:
かか
がが
さ
カイ

collation sort:
カイ
かか
がが
さ

1

2

3

4

5

6

7

8

9

10

11

12

Output:

normal sort:

かか

がが

さ

カイ

collation sort:

カイ

かか

がが

さ

Summary

Unicode and Java have resolved most of i18n issues
Still, application programs should take care of some issues
Please improve your system with a proper i18n knowledge

2021年8月
月	火	水	木	金	土	日
« 4月
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Internationalization

INOUE Seiichiro

Ariel Networks, Inc. CTO

Works Applications Guest Fellow

Who I am

Who I am (cont.)

Today’s Topic

Agenda

1. i18n overview

What i18n is

making software handle country/region/culture specific things

locale

country/region/culture specific things

=> locale

i18n examples (1)

i18n examples (2)

i18n examples (3)

i18n examples (4)

i18n evolution

2. languages and characters in computer

what we should do

two steps:

in order to define character set

we should know what characters are

Question; what is character?

Question; what is character? (cont.)

Question; what is character? (cont.)

language examples

language/script separation

script examples

language/script mapping

character set

character set examples

Unicode (try to contain all scripts in the world)

Where should languages be taken care of?

3. History of character code

brief history of character set

brief history of character set (cont.)

Terms

European languages history

Japanese history (1)

Japanese history (2)

ISO 2022

Japanese history (3)

4. Unicode & Java String

Unicode

The Unicode Consortium and The Unicode Standard

http://www.unicode.org/

Unicode brief history

ISO/IEC 10646

Encoding scheme

UTF(Unicode/UCS Transformation Format)

UTF-8

UTF-8 (cont.)

UTF-8 conversion example

UTF-8 conversion example (cont.)

UTF-16

UTF-16 Surrogate pair

UTF-16 Surrogate pair(cont.)

UTF-16 conversion example

UTF-16 conversion example (cont.)

UTF-32

fyi, Unicode character database

Java String

Java String’s surrogate pair

Composed characters

=> normalization (to canonical format)

Java String’s composed characters

Java String’s composed characters(cont.)

Java String’s composed characters (cont.)

Java String’s composed characters (cont.)

Java String’s composed characters (cont.)

Java String’s composed characters (cont.)

What collation is

sort and equality algorithm of characters

Unicode collation

Java String’s collation

Summary

関連文書:

One Response to “ありえるえりあ復帰”

making software handle
country/region/culture specific things