lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


https://en.m.wikipedia.org/wiki/UTF-8

Print out the raw bytes of the string in hexadecimal. Then try decoding those bytes by hand, following a description of how UTF-8 works.

Code points (Unicode’s generic word for “character”, given that some languages don’t use “characters” in the English sense) from 00 to 7F represent themselves, so that all ASCII strings are UTF-8 strings. From 80 upwards, UTF-8 uses two or more bytes to encode each numerical code point, even though the numbers 80 to FF themselves fit into one byte.

If you manually encode and decode the characters r, é, s, u and m you will get the hang of it.

A clever encoding system, designed one evening in a fast food restaurant in New Jersey if memory serves.

On 20 Jun 2022, at 08:56, Budi <budikusasi@gmail.com> wrote:

In learning lua, suddenly met this:

utf8.codepoint("résumé", 1,2)
114    233

utf8.codepoint("résumé", 1,3)
114    233

utf8.codepoint("résumé", 1,4)
114    233    115


utf8.len("résumé")
6

Any crystal clear explanation ?