lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 09.02.2012 19:37, Roberto Ierusalimschy wrote:

Getting a new library into the lua core is unlikely, but could happen.

A very basic support for UTF-8, in the lines suggested by Miles Bader,
seems a good start. Something more or less like this:

utf8.len(s, [l]) ->  number of code points in s up to 'l'-th byte (or nil
if s is not properly formed)

utf8.byteoffset(s, l) ->  offset (in bytes) where 'l'-th code point
starts

utf8.frontier(s, l) ->  offset (in bytes) where code point containing
l-th byte starts (ends?)

utf8.codepoint(s, i, j) ->  code points in s from *byte* offset i to j
(default i=1, j=i); i adjusts backward and j adjusts forward until a
proper frontier. (It might be useful another function to return a table
with those code points; {utf8.codepoint(s, 1, -1)} may be too heavy.)

utf8.char(cp1, cp2, ...) ->  string formed by code points cp1, cp2, ...
(If cp1 is a table, string formed by the code points in it?)

For short strings I find a different approach more convenient: Transform the Lua string into an array of strings, where each element contains a complete UTF-8 sequence, and then operate on that array. This may be more expensive with regard to memory, but IMO it's easier to handle, and probably also faster (no need to iterate through the string to find the n-th character, etc.). Except for the pattern matching functions, most string functions can easily be re-written for this data type, often as one-liners. After editing, a simple table.concat() transforms this structure back into a Lua string.

Regards,
Bernd

--
Bernd Eggink
http://sudrala.de