Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Adrian Perez <moebius.lists@...>
Date: Thu, 7 Dec 2006 18:42:19 +0100

Hello,

Maybe a little offtopic, but here we go:

On Thu, 7 Dec 2006 08:55:32 -0800
"Ken Smith" <kgsmith@gmail.com> wrote:

> On 12/7/06, Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:
> > If I understand correctly, even asian languages use ascii
> > punctuation (dots, spaces, newlines, commas, etc.), which uses 1
> > byte in utf-8 but 2 in utf-16. So, even for these languages utf-8
> > it is not so less compact as it seems.
> 
> I don't know about other Asian languages but Japanese has special
> punctuation characters.  There is even a wide character for space.
> Here are some of them with their ASCII equivalents; I hope your mil
> reader groks them.
> 
> . = 。
> , = 、
> " " = 「　」 (note the wide space within the Japanese-style quotes)
> 
> I believe newline is the same in Japanese character sets as it is in
> ASCII and I presume this extends into UTF-8.

I started learning Japanese (日本語) one month ago in my spare time, so
I made a quite complete Unicode setup and as far as I know newline is
the same. Also I'm very happy using UTF-8 for all the stuff, for
example matching words with grep is possible (as someone pointed out,
"traditional" tools still work to some extent), for example:

  $ echo 'こんいちーわ' > foo
  $ echo 'すし' >> foo
  $ grep 'し' foo
  すし

(Yes, I checked this even in recent Linux/BSD and older Solaris
systems, at it still works, the same goes for most text utils... but
don't expect character ranges in regexps like '[あ-う]' to work,
because most apps assume one byte per glyph).

Also expect Japanese scripts　(esp. hiragana, katakana) taking
about half the glyphs used by its transliteration in latin alphabet
(romaji). This is not true with all words, but average saving is more
than 50%. Just take an example with some random words:

 word                   romaji      glyphs  glyphs hiragana
 ---------------------  ----------- ------  ------ ------------
 tree                   ki          2       1      き
 sushi                  sushi       5       2      すし
 camera                 kamera      6       3      かめら
 I                      watashi     7       3      わたし
 to be                  desu        4       2      です
 newspaper              shinbun     7       4      しんぶん
 superficial knowledge  icchihankai 11      7      いっちはんかい

As you can see, Japanese uses sometimes less words than English for the
same concepts　(as in the last example), and even comparing
Japanese-romaji to Japanese-hiragana, the latter uses half of the
glyphs　=)

> However, as some of the other readers have pointed out, many of the
> multibyte characters express denser ideas so the ideas per byte is
> probably not too much different from European languages.  Here are
> some characters the Japanese use frequently with their English
> equivalents.  I have chosen non-sino characters to try to make my
> point more relevant to the English speaking readership.
> 
> ☎ or ℡ = Tel (when listing telephone numbers)
> a 〜 b = a to b or from a to b

Totally agree here, "superficial knowledge" may be even written as
一知半解, which is only 4 glyphs compared to the 21 glyphs used in
English!

Just my two cents. I would really appreciate Unicode support in Lua. I
vote for enforcing UTF-8 as encoding for source files. Python is a
somewhat hackish: it tries to detect encoding by using a special comment
on the first 5 lines of code like '# -*- encoding: utf-8 -*-'. It works
but I think it's quite awkward...

Cheers,

-- 
User:       I'm having problems with my text editor.
Help desk:  Which editor are you using?
User:       I don't know, but it's version VI (pronounced: 6).
Help desk:  Oh, then you should upgrade to version VIM (pronounced:
994).

Attachment: signature.asc
Description: PGP signature

Follow-Ups:
- Re: question about Unicode, David Given
- Re: question about Unicode, Asko Kauppi
- Re: question about Unicode, Philippe Lhoste

References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith

Prev by Date: Re: How to get lua error stack in c++
Next by Date: Re: question about Unicode
Previous by thread: Re: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread