lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


For what it's worth, Unicode has built-in type identifiers that, at the beginning of a file, reveal its byte order and encoding. This is of course only needed in UTF-16 et.al. -- with UTF-8 it does not matter. But there may be some identifier "stamp" that can be used to know a file is UTF-8, no?

Anyways, David's idea about allowing all higher codes as identifiers sound good.

- -asko


Adrian Perez kirjoitti 7.12.2006 kello 19.42:


Hello,

Maybe a little offtopic, but here we go:

On Thu, 7 Dec 2006 08:55:32 -0800
"Ken Smith" <kgsmith@gmail.com> wrote:

On 12/7/06, Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:
If I understand correctly, even asian languages use ascii
punctuation (dots, spaces, newlines, commas, etc.), which uses 1
byte in utf-8 but 2 in utf-16. So, even for these languages utf-8
it is not so less compact as it seems.

I don't know about other Asian languages but Japanese has special
punctuation characters.  There is even a wide character for space.
Here are some of them with their ASCII equivalents; I hope your mil
reader groks them.

. = 。
, = 、
" " = 「 」 (note the wide space within the Japanese-style quotes)

I believe newline is the same in Japanese character sets as it is in
ASCII and I presume this extends into UTF-8.

I started learning Japanese (日本語) one month ago in my spare time, so
I made a quite complete Unicode setup and as far as I know newline is
the same. Also I'm very happy using UTF-8 for all the stuff, for
example matching words with grep is possible (as someone pointed out,
"traditional" tools still work to some extent), for example:

  $ echo 'こんいちーわ' > foo
  $ echo 'すし' >> foo
  $ grep 'し' foo
  すし

(Yes, I checked this even in recent Linux/BSD and older Solaris
systems, at it still works, the same goes for most text utils... but
don't expect character ranges in regexps like '[あ-う]' to work,
because most apps assume one byte per glyph).

Also expect Japanese scripts (esp. hiragana, katakana) taking
about half the glyphs used by its transliteration in latin alphabet
(romaji). This is not true with all words, but average saving is more
than 50%. Just take an example with some random words:

 word                   romaji      glyphs  glyphs hiragana
 ---------------------  ----------- ------  ------ ------------
 tree                   ki          2       1      き
 sushi                  sushi       5       2      すし
 camera                 kamera      6       3      かめら
 I                      watashi     7       3      わたし
 to be                  desu        4       2      です
 newspaper              shinbun     7       4      しんぶん
superficial knowledge icchihankai 11 7 いっちはん かい

As you can see, Japanese uses sometimes less words than English for the
same concepts (as in the last example), and even comparing
Japanese-romaji to Japanese-hiragana, the latter uses half of the
glyphs =)

However, as some of the other readers have pointed out, many of the
multibyte characters express denser ideas so the ideas per byte is
probably not too much different from European languages.  Here are
some characters the Japanese use frequently with their English
equivalents.  I have chosen non-sino characters to try to make my
point more relevant to the English speaking readership.

☎ or ℡ = Tel (when listing telephone numbers)
a 〜 b = a to b or from a to b

Totally agree here, "superficial knowledge" may be even written as
一知半解, which is only 4 glyphs compared to the 21 glyphs used in
English!

Just my two cents. I would really appreciate Unicode support in Lua. I
vote for enforcing UTF-8 as encoding for source files. Python is a
somewhat hackish: it tries to detect encoding by using a special comment on the first 5 lines of code like '# -*- encoding: utf-8 -*-'. It works
but I think it's quite awkward...

Cheers,

--
User:       I'm having problems with my text editor.
Help desk:  Which editor are you using?
User:       I don't know, but it's version VI (pronounced: 6).
Help desk:  Oh, then you should upgrade to version VIM (pronounced:
994).

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iD8DBQFFeHn3GJtHlJZfjQoRAo0kAJ9AXkmPd/+QQzLlvw5RhSZCxVgHaQCfXQjO
FmFNS2QujRMntzxEgamwL6w=
=oUU2
-----END PGP SIGNATURE-----