Posted by & filed under 勉強会.


1年前にブログ出張すると書きました。1年経って、ありえるえりあに戻ってくることにしました。

結局、出張先ブログであまり記事を書いていないので、ブログを書く習慣が微妙に失われている気がします。

先日、インド学生相手にプレゼンをしたのでその資料を公開します。タイトルは「Internationalization(国際化プログラミング)」ですが、プレゼンを聞いた某氏から、文字コードの話しかしていないじゃないですか、と嫌味を言われました。そのとおりです。自分の英語力では、難しい話はできませんでした。説明が簡単であることは、外国人と一緒の開発では重要な要素です。

inoue-i18n.tar.gzを展開してi18n.htmlファイルをWebブラウザで開いてください。

以下、プレゼン資料の文字部分です。



Internationalization

INOUE Seiichiro

Ariel Networks, Inc. CTO

Works Applications Guest Fellow

Who I am

  • CTO of Ariel Networks, Inc.
  • Guest Fellow of ATE division at Works Applications
  • Formerly, a developer of Lotus Notes at Boston

Who I am (cont.)

  • My books and articles:
    • “Perfect Java”
    • “Perfect JavaScript”
    • “P2P textbook”
    • “Server Side JavaScript”
    • etc.

Today’s Topic

  • Internationalization (in short, i18n)
    • history and concept
    • character code

Agenda

  1. i18n overview
  2. languages and characters in computer
  3. History of character code
  4. Unicode & Java String

1. i18n overview

What i18n is

making software handle
country/region/culture specific things

locale

country/region/culture specific things

=> locale

i18n examples (1)

  • language
    • input
    • output

i18n examples (2)

  • timezone
  • calendar (e.g. Japanese era, holidays, etc.)
  • date/time format (e.g. 2013/7/3, Jul/3/2013)

i18n examples (3)

  • number format (e.g. 100,000.00 symbol for decimal point)
  • monetary/currency (e.g. dollar, yen, rupee)
  • measurement units (e.g. mile, feet)

i18n examples (4)

  • name format (e.g. order of family name/given name, middle name, Mr/Ms)
  • address format (e.g. postal code, the order of address elements)
  • telephone number format (e.g. number of digits)
  • icon (e.g. postbox)

i18n evolution

2. languages and characters in computer

what we should do

two steps:

  1. define character set
  2. assign number (=code point) to each character

in order to define character set

we should know what characters are

definition of character
the elements of written languages

Question; what is character?

Question; what is character? (cont.)

Question; what is character? (cont.)

  • resolved only by agreement
  • we have to think about languages

language examples

  • Japanese
  • English
  • French
  • Hindi
  • Marathi
  • Chinese

language/script separation

script examples

  • latin alphabet
  • hiragana
  • katakana
  • kanji
  • hangul
  • devanagari

language/script mapping

character set

character set examples

  • ASCII (latin alphabet, some symbols)
  • JIS 0208 (latin alphabet, hiragana, katakana, kanji, etc.)
  • ISCII (devanagari, etc)

Unicode (try to contain all scripts in the world)

Where should languages be taken care of?

3. History of character code

brief history of character set

  • 1963 ASCII
  • 1967 ISO/IEC 646
  • 1969 JIS 0201
  • 1973 ISO 2022
  • 1978 JIS 0208
  • 1982 CP932(MS-DOS)
  • 1985 EUC-JP
  • 1987 ISO 8859

brief history of character set (cont.)

  • 1988 Unicode 88
  • 1991 ISCII
  • 1992 Unicode v1.0
  • 1993 ISO 10646
  • …(Unicode version up)
  • 2012 Unicode v6.2

Terms

  • ISO: International Organization for Standardization
  • IEC: International Electrotechnical Commission
  • JIS: Japanese Industrial Standards
  • ASCII: American standard code for information interchange
  • ISCII: Indian Script Code for Information Interchange

European languages history

Japanese history (1)

Japanese history (2)

ISO 2022

Japanese history (3)

4. Unicode & Java String

Unicode

The Unicode Consortium and The Unicode Standard

http://www.unicode.org/

  • character set
  • encoding scheme
  • collation rule
  • various algorithm (e.g. BiDi)

Unicode brief history

  1. Unicode 88 …16bit code
  2. Original ISO 10646 …32bit code
  3. ISO 10646 accepted Unicode … super set of Unicode. 32bit code
  4. ISO 10646/Unicode grows …21bit code

ISO/IEC 10646

Encoding scheme

UTF(Unicode/UCS Transformation Format)

  • UTF-8
  • UTF-16
  • UTF-32
  • (UTF-7)

UTF-8

  • 8-bit variable-width encoding (a.k.a. multi-byte char)
  • upper compatible with ASCII
  • getting the standard in Internet

UTF-8 (cont.)

UTF-8 conversion example

UTF-8 conversion example (cont.)

UTF-16

  • 16-bit, variable-width encoding
  • Most of characters are encoded in 16-bit fixed-width (same as BMP/UCS-2)
  • Using surrogate pair for characters other than BMP
  • Sometimes, use BOM(Byte Order Mark)
  • Java string internal encoding scheme

UTF-16 Surrogate pair

UTF-16 Surrogate pair(cont.)

UTF-16 conversion example

UTF-16 conversion example (cont.)

UTF-32

  • 32-bit fixed-width encoding
  • Caution: considering composed characters, one character still cannot be 32-bit width

fyi, Unicode character database

  • http://www.unicode.org/charts/
  • http://www.unicode.org/charts/charindex.html

Java String

  • The internal encoding is UTF-16
  • Not UCS-2, BMP
  • (Before Java 1.4, it used to be UCS-2)
  • Without considering surrogate pair and composed characters, easy to use (16-bit fixed width encoding)

Java String’s surrogate pair

Composed characters

  • precomposed character
  • combining

=> normalization (to canonical format)

Java String’s composed characters

Java String’s composed characters(cont.)

Java String’s composed characters (cont.)

Java String’s composed characters (cont.)

Java String’s composed characters (cont.)

Java String’s composed characters (cont.)

What collation is

sort and equality algorithm of characters

  • In ASCII era, collation is similar to character code
  • However, we sometimes need a different sort algorithm such as case-ignore
    • usual order: A,B,a,b
    • case-ignore order: A,a,B,b

Unicode collation

  • Each character has a collation value

Java String’s collation

Summary

  • Unicode and Java have resolved most of i18n issues
  • Still, application programs should take care of some issues
  • Please improve your system with a proper i18n knowledge


関連文書:

  • 関連文書は見つからんがな

One Response to “ありえるえりあ復帰”

  1. avatar

    おおたに

    > プレゼンを聞いた某氏から、文字コードの話しかしていないじゃないですか、と嫌味を言われました。

    ひどいことを言う人もいるものですね。