lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Mon, 20 Jun 2022 at 09:56, Budi <budikusasi@gmail.com> wrote:
> In learning lua, suddenly met this:
> > utf8.codepoint("résumé", 1,2)
> 114     233
> > utf8.codepoint("résumé", 1,3)
> 114     233
> > utf8.codepoint("résumé", 1,4)
> 114     233     115
> > utf8.len("résumé")
> 6
> Any crystal clear explanation ?

RTFM?
"utf8.codepoint (s [, i [, j [, lax]]])
Returns the code points (as integers) from all characters in s that
start between byte position i and j (both included). The default for i
is 1 and for j is i. It raises an error if it meets any invalid byte
sequence."

utf8 treats string like byte arrays. If no one has been playing tricks
with encodings, pasting from the mail, your string is:

$ echo -n résumé | od -t u1
0000000 114 195 169 115 117 109 195 169
Bytes      1     2     3     4     5     6     7      8
Chars     1     2     2'     3     4     5     6      6'

>From these it should be clear. ( Next can help, as your string is latin1 )

$ echo -n résumé | recode utf8..latin1 | od -t u1
0000000 114 233 115 117 109 233

FOS.