[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lua 5.3.0-work2: utf8.sub?
- From: Sean Conner <sean@...>
- Date: Fri, 11 Apr 2014 02:10:21 -0400
It was thus said that the Great Dirk Laurie once stated:
> There's no utf8.sub. Anyone tried to code it in pure Lua yet?
> This is my attempt:
>
> function utf8.sub(s,i,j)
> i = i or 1
> j = j or -1
> if i<1 or j<1 then
> local n = utf8.len(s)
> if not n then return nil end
> if i<0 then i = n+1+i end
> if j<0 then j = n+1+j end
> if i<0 then i = 1 elseif i>n then i = n end
> if j<0 then j = 1 elseif j>n then j = n end
> end
> if j<i then return "" end
> i = utf8.offset(s,i)
> j = utf8.offset(s,j+1)
> if i and j then return s:sub(i,j-1)
> elseif i then return s:sub(i)
> else return ""
> end
> end
>
> Notes:
>
> 1. Care is taken to avoid calculating utf8.len(s) when it is not necessary.
> 2. The use of utf8.offset implies that the result is undefined when s is not
> a valid UTF8 sequence.
Do i and j represent Unicode values or graphemes? I'm doing some reading
[1] and it appears the easiest thing to do is treat and indicies as Unicode
values.
-spc (Because sometimes a character and combining character count as a
single grapheme, and sometimes it doesn't ... )
[1] http://www.unicode.org/faq/char_combmark.html