lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Dirk Laurie once stated:
> There's no utf8.sub. Anyone tried to code it in pure Lua yet?
> This is my attempt:
> 
> function utf8.sub(s,i,j)
>    i = i or 1
>    j = j or -1
>    if i<1 or j<1 then
>       local n = utf8.len(s)
>       if not n then return nil end
>       if i<0 then i = n+1+i end
>       if j<0 then j = n+1+j end
>       if i<0 then i = 1 elseif i>n then i = n end
>       if j<0 then j = 1 elseif j>n then j = n end
>    end
>    if j<i then return "" end
>    i = utf8.offset(s,i)
>    j = utf8.offset(s,j+1)
>    if i and j then return s:sub(i,j-1)
>       elseif i then return s:sub(i)
>       else return ""
>    end
> end
> 
> Notes:
> 
> 1. Care is taken to avoid calculating utf8.len(s) when it is not necessary.
> 2. The use of utf8.offset implies that the result is undefined when s is not
> a valid UTF8 sequence.

  Do i and j represent Unicode values or graphemes?  I'm doing some reading
[1] and it appears the easiest thing to do is treat and indicies as Unicode
values.

  -spc (Because sometimes a character and combining character count as a
	single grapheme, and sometimes it doesn't ... )

[1]	http://www.unicode.org/faq/char_combmark.html