Maybe a little offtopic, but here we go:
On Thu, 7 Dec 2006 08:55:32 -0800
"Ken Smith" <firstname.lastname@example.org> wrote:
On 12/7/06, Roberto Ierusalimschy <email@example.com> wrote:
If I understand correctly, even asian languages use ascii
punctuation (dots, spaces, newlines, commas, etc.), which uses 1
byte in utf-8 but 2 in utf-16. So, even for these languages utf-8
it is not so less compact as it seems.
I don't know about other Asian languages but Japanese has special
punctuation characters. There is even a wide character for space.
Here are some of them with their ASCII equivalents; I hope your mil
reader groks them.
. = 。
, = 、
" " = 「 」 (note the wide space within the Japanese-style
I believe newline is the same in Japanese character sets as it is in
ASCII and I presume this extends into UTF-8.
I started learning Japanese (日本語) one month ago in my spare
I made a quite complete Unicode setup and as far as I know newline is
the same. Also I'm very happy using UTF-8 for all the stuff, for
example matching words with grep is possible (as someone pointed out,
"traditional" tools still work to some extent), for example:
$ echo 'こんいちーわ' > foo
$ echo 'すし' >> foo
$ grep 'し' foo
(Yes, I checked this even in recent Linux/BSD and older Solaris
systems, at it still works, the same goes for most text utils... but
don't expect character ranges in regexps like '[あ-う]' to work,
because most apps assume one byte per glyph).
Also expect Japanese scripts (esp. hiragana, katakana) taking
about half the glyphs used by its transliteration in latin alphabet
(romaji). This is not true with all words, but average saving is more
than 50%. Just take an example with some random words:
word romaji glyphs glyphs hiragana
--------------------- ----------- ------ ------ ------------
tree ki 2 1 き
sushi sushi 5 2 すし
camera kamera 6 3 かめら
I watashi 7 3 わたし
to be desu 4 2 です
newspaper shinbun 7 4 しんぶん
superficial knowledge icchihankai 11 7 いっちはん
As you can see, Japanese uses sometimes less words than English for
same concepts (as in the last example), and even comparing
Japanese-romaji to Japanese-hiragana, the latter uses half of the
However, as some of the other readers have pointed out, many of the
multibyte characters express denser ideas so the ideas per byte is
probably not too much different from European languages. Here are
some characters the Japanese use frequently with their English
equivalents. I have chosen non-sino characters to try to make my
point more relevant to the English speaking readership.
☎ or ℡ = Tel (when listing telephone numbers)
a 〜 b = a to b or from a to b
Totally agree here, "superficial knowledge" may be even written as
一知半解, which is only 4 glyphs compared to the 21 glyphs used in
Just my two cents. I would really appreciate Unicode support in Lua. I
vote for enforcing UTF-8 as encoding for source files. Python is a
somewhat hackish: it tries to detect encoding by using a special
on the first 5 lines of code like '# -*- encoding: utf-8 -*-'. It
but I think it's quite awkward...
User: I'm having problems with my text editor.
Help desk: Which editor are you using?
User: I don't know, but it's version VI (pronounced: 6).
Help desk: Oh, then you should upgrade to version VIM (pronounced: