[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Managing Unicode (UTF-8 and UTF-16) data in Lua
- From: Coda Highland <chighland@...>
- Date: Sun, 7 Aug 2016 13:22:15 -0700
On Sun, Aug 7, 2016 at 1:21 PM, Coda Highland <firstname.lastname@example.org> wrote:
> On Sun, Aug 7, 2016 at 7:59 AM, Egor Skriptunoff
> <email@example.com> wrote:
>>> > Operations on fixed width character strings (such as UTF-16) are
>>> > processed faster.
>>> UTF-16 isn't fixed char width.
>> Yes, you are absolutely correct.
>> UTF-16 uses surrogate pairs to represent codepoints above 0x10000.
>> But Windows does not support them.
>> When you are writing a surrogate-pair-symbol to Windows console
>> (I've tested this on Win7 with a simple program using WriteConsoleW),
>> it gets displayed as two question marks,
>> that is, Windows considers it as two separate symbols instead of just one.
>> If Windows does not support surrogate pairs, why should we?
>> That's why we can treat UTF-16 on Windows as fixed-char-width encoding.
>> Of course, this means that 100% correct Unicode "print()" function is
>> non-implementable for Windows console applications.
> Windows DOES "support" surrogates -- it upgraded from UCS-2
> (equivalent to UTF-16 constrained to the BMP) to UTF-16 a long time
> ago (Win7, I think). But it supports them in the sense that it renders
> them correctly and won't screw them up if they exist. The support is
> roughly equivalent to Lua's UTF-8 support: if you know what you're
> doing and you explicitly ask for it, then it can deal with it, but if
> you just use the naive wide-string functions it'll treat them as
> multiple characters.
> /s/ Adam
Though I should clarify: WINDOWS supports it, but the Windows CONSOLE
does not; I don't mean to argue with Egor's comment regarding