Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Asko Kauppi <askok@...>
Date: Thu, 7 Dec 2006 22:30:45 +0200

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

For what it's worth, Unicode has built-in type identifiers that, atthe beginning of a file, reveal its byte order and encoding. This isof course only needed in UTF-16 et.al. -- with UTF-8 it does notmatter. But there may be some identifier "stamp" that can be used toknow a file is UTF-8, no?

Anyways, David's idea about allowing all higher codes as identifierssound good.


- -asko


Adrian Perez kirjoitti 7.12.2006 kello 19.42:


Hello,

Maybe a little offtopic, but here we go:

On Thu, 7 Dec 2006 08:55:32 -0800
"Ken Smith" <kgsmith@gmail.com> wrote:

On 12/7/06, Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:

If I understand correctly, even asian languages use ascii
punctuation (dots, spaces, newlines, commas, etc.), which uses 1
byte in utf-8 but 2 in utf-16. So, even for these languages utf-8
it is not so less compact as it seems.


I don't know about other Asian languages but Japanese has special
punctuation characters.  There is even a wide character for space.
Here are some of them with their ASCII equivalents; I hope your mil
reader groks them.

. = 。
, = 、

" " = 「　」 (note the wide space within the Japanese-stylequotes)


I believe newline is the same in Japanese character sets as it is in
ASCII and I presume this extends into UTF-8.

I started learning Japanese (日本語) one month ago in my sparetime, so

I made a quite complete Unicode setup and as far as I know newline is
the same. Also I'm very happy using UTF-8 for all the stuff, for
example matching words with grep is possible (as someone pointed out,
"traditional" tools still work to some extent), for example:

  $ echo 'こんいちーわ' > foo
  $ echo 'すし' >> foo
  $ grep 'し' foo
  すし

(Yes, I checked this even in recent Linux/BSD and older Solaris
systems, at it still works, the same goes for most text utils... but
don't expect character ranges in regexps like '[あ-う]' to work,
because most apps assume one byte per glyph).

Also expect Japanese scripts　(esp. hiragana, katakana) taking
about half the glyphs used by its transliteration in latin alphabet
(romaji). This is not true with all words, but average saving is more
than 50%. Just take an example with some random words:

 word                   romaji      glyphs  glyphs hiragana
 ---------------------  ----------- ------  ------ ------------
 tree                   ki          2       1      き
 sushi                  sushi       5       2      すし
 camera                 kamera      6       3      かめら
 I                      watashi     7       3      わたし
 to be                  desu        4       2      です
 newspaper              shinbun     7       4      しんぶん

superficial knowledge icchihankai 11 7 いっちはんかい

As you can see, Japanese uses sometimes less words than English forthe

same concepts　(as in the last example), and even comparing
Japanese-romaji to Japanese-hiragana, the latter uses half of the
glyphs　=)

However, as some of the other readers have pointed out, many of the
multibyte characters express denser ideas so the ideas per byte is
probably not too much different from European languages.  Here are
some characters the Japanese use frequently with their English
equivalents.  I have chosen non-sino characters to try to make my
point more relevant to the English speaking readership.

☎ or ℡ = Tel (when listing telephone numbers)
a 〜 b = a to b or from a to b


Totally agree here, "superficial knowledge" may be even written as
一知半解, which is only 4 glyphs compared to the 21 glyphs used in
English!

Just my two cents. I would really appreciate Unicode support in Lua. I
vote for enforcing UTF-8 as encoding for source files. Python is a

somewhat hackish: it tries to detect encoding by using a specialcommenton the first 5 lines of code like '# -*- encoding: utf-8 -*-'. Itworks

but I think it's quite awkward...

Cheers,

--
User:       I'm having problems with my text editor.
Help desk:  Which editor are you using?
User:       I don't know, but it's version VI (pronounced: 6).
Help desk:  Oh, then you should upgrade to version VIM (pronounced:
994).


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iD8DBQFFeHn3GJtHlJZfjQoRAo0kAJ9AXkmPd/+QQzLlvw5RhSZCxVgHaQCfXQjO
FmFNS2QujRMntzxEgamwL6w=
=oUU2
-----END PGP SIGNATURE-----

Follow-Ups:
- Re: question about Unicode, Brian Weed
- Re: question about Unicode, Doug Rogers

References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez

Prev by Date: Re: question about Unicode
Next by Date: Re: question about Unicode
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread