[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Unicode?
- From: RLake@...
- Date: Thu, 12 Jun 2003 11:14:00 -0500
> What about string.byte and string.char?
> string.byte is documented as returning "the internal
> numerical code of the i-th _character_ of s".
> Would that mean the 16-bit (32-bit?) Unicode value for the i-th Unicode
> character, or the (probably 8-bit) value of the i-th raw byte?
> (same as above, in reverse perspective, for string.char)
Yes, that is the problem.
It is quite nice that Lua allows arbitrary octet-sequences in strings,
and I think this should be part of the specification (i.e., a string
is an immutable vector of an opaque enumeration with at least 256
elements; it would be even nicer to say exactly 256 elements to
allow reading of binary files but there are architectures where
that may not be practical.)
"Display strings" could be implemented as userdata with their own
__concat and __tostring methods, and member functions to do
stuff like extract substrings, the coding of the i-th character
(both of these are linear time if the internal representation
is not at least 21 bits) and convert from an octet-string.
Plus some sort of standard iterator method. In fact, there
could be a variety of such types.
The current implementation, where string comparison (but not
equality) is affected by the locale setting, is slightly
unpleasant; it makes it difficult to use strings reliably
outside of the C locale.
For example, if the locale is not C, there is no guarantee that
(s < t or t < s or s == t)