lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Jay Carlson <nop@nop.com> writes:
>> > I imagine it already has in non-byte locales. A Korean Lua program will not
>> > use libraries from a Greek one. Unless everybody is already using
>> > UTF-8--with no sanity checking.
>>
>> Of course everybody's just using UTF-8 with no sanity checking...
>
> Got some Koreans to back you up on that? EUC-KR (realistically, CP949)
> still lives; see the <head> of http://chosun.com and donga.com.
> naver.com is UTF-8 (not a total surprise). Perhaps all the Korean text
> processing *in Lua* is being done in unchecked UTF-8, but I kinda
> doubt it. ...

Of course I'm not claiming that all text-processing (even Lua
text-processing) is now done in UTF-8 -- there was a touch of
tongue-in-cheek to my message (but an element of truth as well).... :]

I live in Japan and write software for a Japanese company, so I have a
little experience in the matter.  Shift-JIS (for "local" use) and
EUC-JP (for email) are still _hugely_ used.  [At my work, we tried to
standardize on UTF-8 for a project, but ended up using Shift-JIS
simply because it's the only encoding that MS dev tools support
sanely, and has a lot of legacy support in other tools.  The encoding
support in our own code is a mess that tries to generally support
various multibyte encodings, but in practice probably only has to work
properly for Shift-JIS and UTF-8.]

Nonetheless, my intuition is that:

  (a) The Lua universe is not the more general universe.  Projects
  using Lua tend to be smaller, and less dependent on giant
  frameworks.  The tradeoffs for Lua projects are often somewhat
  different as a result.  Simple is good.

  (b) UTF-8 is generally considered the future here (something that
  must be supported now, and will increasingly replace other
  encodings) even in areas where there's a lot of legacy need/support
  for other encodings.  People don't use older encodings because they
  _can_, but because they _must_.  There's definitely an awareness
  that moving to UTF-8 is something that should be done if it's
  possible though.

  (c) In a lot of applications, not all that much "detail handling" is
  needed for what text-processing they do -- and if UTF-8 can be
  assumed, things get _much simpler_.  [E.g., if you have manipulate
  multi-byte-encoded pathnames (or anything with meaningful ASCII
  syntax), it's really simple in UTF-8 -- the same code one uses for
  ASCII will work fine -- but miserable in Shift-JIS, because random
  ASCII characters can occur in the middle of multi-byte characters.]

So I'd say there's several levels of text-processing support:

  (1) If you can, don't even bother: treat strings as blobs, and
  don't care whats in them (the default Lua state).

  (2) If you need to do a little manipulation, try to use UTF-8 for
  the encoding, but don't make any particular attempt to hide that
  fact that they are encoded (i.e., don't pretend that strings are
  "sequences of characters").  Only use what small functions you can
  get away with (assuming UTF-8 makes this _much_ easier), e.g.,
  counting characters, converting between byte- and character- offsets
  etc.  Such functions for UTF-8 are generally so easy that for many
  projects it's fine to just write them yourself, but a "tiny-utf8"
  library might not be a bad idea (which basically supports only stuff
  that's either trivial as a result of the encoding properties, or can
  be very compactly encoded).

  (3) If you're doing full-fat text-processing (text-editor, etc),
  maybe you do need real unicode support, giant tables and all.  It
  would be good to have a standard Lua library for this (there is one,
  I think, but I don't remember the name).

For cases where legacy non-ASCII encodings _need_ to be supported,
especially if you need "full-fat" features, I dunno what good choices
there are, especially if you want to be portable (and so can't rely on
e.g. iconv)...  Handling legacy multi-byte encodings is generally a
lot messier and more intrusive, and so should be avoided if possible;
sometimes platform libraries can make things easier, but that sort of
moves out of the realm of general Lua discussion.

-miles

-- 
Any man who is a triangle, has thee right, when in Cartesian Space,
to have angles, which when summed, come to know more, nor no less,
than nine score degrees, should he so wish.  [TEMPLE OV THEE LEMUR]