lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


When implementing some UTF-8 support very similar to this in a recent project, my utf8.codepoints() function was suitable for use in the generic for construct, i.e. it iterated the codepoints. One application of this was checking for codepoints that cannot be represented in the 8-bit SMS alphabet. I am unsure whether there are many applications where fully materializing the codepoints is beneficial. 

On 09.02.2012, at 19:37, Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:

>> Getting lua's core to change its view of strings to being something
>> other than a byte-sequence isn't going to happen, its not the lua way,
> 
> Sure.
> 
> 
>> Getting a new library into the lua core is unlikely, but could happen.
> 
> A very basic support for UTF-8, in the lines suggested by Miles Bader,
> seems a good start. Something more or less like this:
> 
> utf8.len(s, [l]) -> number of code points in s up to 'l'-th byte (or nil
> if s is not properly formed)
> 
> utf8.byteoffset(s, l) -> offset (in bytes) where 'l'-th code point
> starts
> 
> utf8.frontier(s, l) -> offset (in bytes) where code point containing
> l-th byte starts (ends?)
> 
> utf8.codepoint(s, i, j) -> code points in s from *byte* offset i to j
> (default i=1, j=i); i adjusts backward and j adjusts forward until a
> proper frontier. (It might be useful another function to return a table
> with those code points; {utf8.codepoint(s, 1, -1)} may be too heavy.)
> 
> utf8.char(cp1, cp2, ...) -> string formed by code points cp1, cp2, ...
> (If cp1 is a table, string formed by the code points in it?)
> 
> -- Roberto
>