Re: Unicode operation

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode operation
From: Paul Ducklin <pducklin@...>
Date: Mon, 20 Jun 2022 09:22:27 +0000

https://en.m.wikipedia.org/wiki/UTF-8

Print out the raw bytes of the string in hexadecimal. Then try decoding those bytes by hand, following a description of how UTF-8 works.

Code points (Unicode’s generic word for “character”, given that some languages don’t use “characters” in the English sense) from 00 to 7F represent themselves, so that all ASCII strings are UTF-8 strings. From 80 upwards, UTF-8 uses two or more bytes to encode each numerical code point, even though the numbers 80 to FF themselves fit into one byte.

If you manually encode and decode the characters r, é, s, u and m you will get the hang of it.

A clever encoding system, designed one evening in a fast food restaurant in New Jersey if memory serves.

On 20 Jun 2022, at 08:56, Budi <budikusasi@gmail.com> wrote:

In learning lua, suddenly met this:

utf8.codepoint("résumé", 1,2)

114 233

utf8.codepoint("résumé", 1,3)

114 233

utf8.codepoint("résumé", 1,4)

114 233 115

utf8.len("résumé")

6

Any crystal clear explanation ?

References:
- Unicode operation, Budi

Prev by Date: Re: Unicode operation
Next by Date: Re: Unicode operation
Previous by thread: Re: Unicode operation
Next by thread: How determine data type used in function argument?
Index(es):
- Date
- Thread