lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Mon, Feb 6, 2012 at 19:12, Miles Bader <miles@gnu.org> wrote:
> Jay Carlson <nop@nop.com> writes:
>>> > I imagine it already has in non-byte locales. A Korean Lua program will not
>>> > use libraries from a Greek one. Unless everybody is already using
>>> > UTF-8--with no sanity checking.
>>>
>>> Of course everybody's just using UTF-8 with no sanity checking...
>>
>> Got some Koreans to back you up on that? EUC-KR (realistically, CP949)
>> still lives; see the <head> of http://chosun.com and donga.com.
>> naver.com is UTF-8 (not a total surprise). Perhaps all the Korean text
>> processing *in Lua* is being done in unchecked UTF-8, but I kinda
>> doubt it. ...
>
> Of course I'm not claiming that all text-processing (even Lua
> text-processing) is now done in UTF-8 -- there was a touch of
> tongue-in-cheek to my message (but an element of truth as well).... :]
>
> I live in Japan and write software for a Japanese company, so I have a
> little experience in the matter.  Shift-JIS (for "local" use) and
> EUC-JP (for email) are still _hugely_ used.  [At my work, we tried to
> standardize on UTF-8 for a project, but ended up using Shift-JIS
> simply because it's the only encoding that MS dev tools support
> sanely, and has a lot of legacy support in other tools.  The encoding
> support in our own code is a mess that tries to generally support
> various multibyte encodings, but in practice probably only has to work
> properly for Shift-JIS and UTF-8.]
>
> Nonetheless, my intuition is that:
>
>  (a) The Lua universe is not the more general universe.  Projects
>  using Lua tend to be smaller, and less dependent on giant
>  frameworks.  The tradeoffs for Lua projects are often somewhat
>  different as a result.  Simple is good.
>
>  (b) UTF-8 is generally considered the future here (something that
>  must be supported now, and will increasingly replace other
>  encodings) even in areas where there's a lot of legacy need/support
>  for other encodings.  People don't use older encodings because they
>  _can_, but because they _must_.  There's definitely an awareness
>  that moving to UTF-8 is something that should be done if it's
>  possible though.
>
>  (c) In a lot of applications, not all that much "detail handling" is
>  needed for what text-processing they do -- and if UTF-8 can be
>  assumed, things get _much simpler_.  [E.g., if you have manipulate
>  multi-byte-encoded pathnames (or anything with meaningful ASCII
>  syntax), it's really simple in UTF-8 -- the same code one uses for
>  ASCII will work fine -- but miserable in Shift-JIS, because random
>  ASCII characters can occur in the middle of multi-byte characters.]
>
> So I'd say there's several levels of text-processing support:
>
>  (1) If you can, don't even bother: treat strings as blobs, and
>  don't care whats in them (the default Lua state).
>
>  (2) If you need to do a little manipulation, try to use UTF-8 for
>  the encoding, but don't make any particular attempt to hide that
>  fact that they are encoded (i.e., don't pretend that strings are
>  "sequences of characters").  Only use what small functions you can
>  get away with (assuming UTF-8 makes this _much_ easier), e.g.,
>  counting characters, converting between byte- and character- offsets
>  etc.  Such functions for UTF-8 are generally so easy that for many
>  projects it's fine to just write them yourself, but a "tiny-utf8"
>  library might not be a bad idea (which basically supports only stuff
>  that's either trivial as a result of the encoding properties, or can
>  be very compactly encoded).
>
>  (3) If you're doing full-fat text-processing (text-editor, etc),
>  maybe you do need real unicode support, giant tables and all.  It
>  would be good to have a standard Lua library for this (there is one,
>  I think, but I don't remember the name).
>
> For cases where legacy non-ASCII encodings _need_ to be supported,
> especially if you need "full-fat" features, I dunno what good choices
> there are, especially if you want to be portable (and so can't rely on
> e.g. iconv)...  Handling legacy multi-byte encodings is generally a
> lot messier and more intrusive, and so should be avoided if possible;
> sometimes platform libraries can make things easier, but that sort of
> moves out of the realm of general Lua discussion.
>
> -miles
>
> --
> Any man who is a triangle, has thee right, when in Cartesian Space,
> to have angles, which when summed, come to know more, nor no less,
> than nine score degrees, should he so wish.  [TEMPLE OV THEE LEMUR]
>

I do think a simple UTF-8 library would be quite a good thing to have
- basically just have all of Lua's string methods, but operating on
characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
extract the 3rd to 6th characters of str, not necessarily bytes.) My
worry though would be ending up like PHP, where you have to remember
to use the mb_* functions instead of the normal ones.

I suspect this could be accomplished by means of a function that
"converts" a string to a UTF-8 string, which would be represented as a
table or userdata with metamethods to make it behave like a string.
Then you could just write:
str = U'this is a UTF-8 string'
print(#str) --gives number of characters, not number of bytes
the main problem I can see then would be that type(str) ~= "string"...

-- 
Sent from my toaster.