[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
- From: André Naef <andre@...>
- Date: Thu, 9 Feb 2012 23:54:38 +0100
When implementing some UTF-8 support very similar to this in a recent project, my utf8.codepoints() function was suitable for use in the generic for construct, i.e. it iterated the codepoints. One application of this was checking for codepoints that cannot be represented in the 8-bit SMS alphabet. I am unsure whether there are many applications where fully materializing the codepoints is beneficial.
On 09.02.2012, at 19:37, Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:
>> Getting lua's core to change its view of strings to being something
>> other than a byte-sequence isn't going to happen, its not the lua way,
>
> Sure.
>
>
>> Getting a new library into the lua core is unlikely, but could happen.
>
> A very basic support for UTF-8, in the lines suggested by Miles Bader,
> seems a good start. Something more or less like this:
>
> utf8.len(s, [l]) -> number of code points in s up to 'l'-th byte (or nil
> if s is not properly formed)
>
> utf8.byteoffset(s, l) -> offset (in bytes) where 'l'-th code point
> starts
>
> utf8.frontier(s, l) -> offset (in bytes) where code point containing
> l-th byte starts (ends?)
>
> utf8.codepoint(s, i, j) -> code points in s from *byte* offset i to j
> (default i=1, j=i); i adjusts backward and j adjusts forward until a
> proper frontier. (It might be useful another function to return a table
> with those code points; {utf8.codepoint(s, 1, -1)} may be too heavy.)
>
> utf8.char(cp1, cp2, ...) -> string formed by code points cp1, cp2, ...
> (If cp1 is a table, string formed by the code points in it?)
>
> -- Roberto
>
- References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy