Re: Unicode?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode?
From: RLake@...
Date: Thu, 12 Jun 2003 11:14:00 -0500

> What about string.byte and string.char?

> string.byte is documented as returning "the internal
> numerical code of the i-th _character_ of s".

> Would that mean the 16-bit (32-bit?) Unicode value for the i-th Unicode
> character, or the (probably 8-bit) value of the i-th raw byte?

> (same as above, in reverse perspective, for string.char)

Yes, that is the problem.

It is quite nice that Lua allows arbitrary octet-sequences in strings,
and I think this should be part of the specification (i.e., a string
is an immutable vector of an opaque enumeration with at least 256
elements; it would be even nicer to say exactly 256 elements to
allow reading of binary files but there are architectures where
that may not be practical.)

"Display strings" could be implemented as userdata with their own
__concat and __tostring methods, and member functions to do
stuff like extract substrings, the coding of the i-th character
(both of these are linear time if the internal representation
is not at least 21 bits) and convert from an octet-string.
Plus some sort of standard iterator method. In fact, there
could be a variety of such types.

The current implementation, where string comparison (but not
equality) is affected by the locale setting, is slightly
unpleasant; it makes it difficult to use strings reliably
outside of the C locale.

For example, if the locale is not C, there is no guarantee that
  (s < t or t < s or s == t)
is true.

Prev by Date: Re: Using Lua in a C(++) program
Next by Date: Garbage collection (nothing to attract flames ;))
Previous by thread: Re: Unicode?
Next by thread: new md5 library
Index(es):
- Date
- Thread