lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Bertrand Mansion <golgote@mamasam.com> writes:

> Hi,
>
> I am new to lua and currently reading the book.
>
> I am wondering if lua 5.1 supports utf-8 in string handling,
> comparisons, conversions and pattern matching and things like \u in
> strings.
> If not, are there plans to add utf-8 support in the future ?

There is slnunicode.  Personally, I'd like to see transparent handling
of utf-8.  However, this makes strings different from byte streams.
Also, it would appear prudent for reasonable handling of utf-8 strings
to be able to assume them containing only valid byte sequences which
means one needs read and write conversion functions even for files
assumed to be in utf-8 locales (in order to convert illegal byte
sequences into legal ones).

And so on.  slnunicode does not actually do much in the area of
verification.  If one takes a look at the input handling of Emacs, one
feature is that interpreting a file filled with random bytes as utf-8
will still preserve its contents when being rewritten unmodified.
That is because invalid input bytes get turned into special sequences
(that are in turn not considered valid sequences in a file) that get
reconverted to bytes upon writing.

It is not all too clear in my opinion how one could create a small
footprint Lua that supported byte arrays (if you want to, unibyte
strings) and multi-byte character strings where the characters
actually formed atomic string components.

Emacs actually has a flag on every string that distinguishes unibyte
and multibyte strings.

This is considered a design flaw by some, not least of all XEmacs
developers.  On the other hand, XEmacs developers have been forced to
provide an XEmacs binary that supports _only_ unibyte strings in
addition to the version supporting _only_ multibyte strings, because
those that don't need multibyte strings are not willing to pay the
price.

In short: proper utf-8 support comes at a price, and even large
closely related projects don't arrive at the same solutions.

-- 
David Kastrup