lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Roberto Ierusalimschy wrote:
In fact, UTF-8 also uses a maximum of 4 bytes to represent
any code point, but requires 3 bytes to represent code points
in asian languages, so in general terms it is less compact
than UTF-16, but in some applications ("mostly ascii") it will
turn out to be better.

If I understand correctly, even asian languages use ascii punctuation
(dots, spaces, newlines, commas, etc.), which uses 1 byte in utf-8 but 2
in utf-16. So, even for these languages utf-8 it is not so less compact
as it seems.
asian languages hardly use spaces, but i get the impression that they need less characters to express ideas so in the end (translated books are not per se thicker) it's still relatively compact (if chinese puts 30 chars on a line, that means some 100 bytes; a language using the latin script with accents (french, vietnamese, etc) has some 70 chars per line and quite some of them are multibyte which then also adds up to 100+; arab is a different story. i think that compactness is no real issue here (not more than that german needing more characters to express an idea then e.g. french). Hans
-----------------------------------------------------------------
                                         Hans Hagen | PRAGMA ADE
             Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
    tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                            | www.pragma-pod.nl
-----------------------------------------------------------------