Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Bernd Eggink <monoped@...>
Date: Fri, 10 Feb 2012 14:53:31 +0100

On 09.02.2012 19:37, Roberto Ierusalimschy wrote:

Getting a new library into the lua core is unlikely, but could happen.


A very basic support for UTF-8, in the lines suggested by Miles Bader,
seems a good start. Something more or less like this:

utf8.len(s, [l]) ->  number of code points in s up to 'l'-th byte (or nil
if s is not properly formed)

utf8.byteoffset(s, l) ->  offset (in bytes) where 'l'-th code point
starts

utf8.frontier(s, l) ->  offset (in bytes) where code point containing
l-th byte starts (ends?)

utf8.codepoint(s, i, j) ->  code points in s from *byte* offset i to j
(default i=1, j=i); i adjusts backward and j adjusts forward until a
proper frontier. (It might be useful another function to return a table
with those code points; {utf8.codepoint(s, 1, -1)} may be too heavy.)

utf8.char(cp1, cp2, ...) ->  string formed by code points cp1, cp2, ...
(If cp1 is a table, string formed by the code points in it?)

For short strings I find a different approach more convenient: Transformthe Lua string into an array of strings, where each element contains acomplete UTF-8 sequence, and then operate on that array. This may bemore expensive with regard to memory, but IMO it's easier to handle, andprobably also faster (no need to iterate through the string to find then-th character, etc.). Except for the pattern matching functions, moststring functions can easily be re-written for this data type, often asone-liners. After editing, a simple table.concat() transforms thisstructure back into a Lua string.


Regards,
Bernd

--
Bernd Eggink
http://sudrala.de

Follow-Ups:
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), William Ahern

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy

Prev by Date: Re: [LuaJIT] JIT bug when passing many function arguments?
Next by Date: Re: [LuaJIT] JIT bug when passing many function arguments?
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread