lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Windows internally uses either code page dependent multibyte encoding ... there are a lot of different locale specific code pages, but not UTF-8.
Alternatively Windows can work with UTF-16. This will also avoid the need to set a certain code page.

When you have a UTF-8 string on Windows, you can convert it to UTF-16 (wchar_t is 16 bit on Windows) using
MultiByteToWideChar(CP_UTF8, 0, utt8_buffer, -1, wchar_t_buffer, (int)wchar_t_buffer_length);

Windows cannot handle UTF-8 in any other way, but there are a lot of APIs that handle wide characters correctly.
But to use this consistently, e.g. for file names, you also have to consider some standard functions, e.g., io.open, do not use the wide character version. For printing only, you might use the wprintf versions with wide characters.

On Thu, Jan 14, 2021 at 7:30 PM Marcus Mason <m4gicks@gmail.com> wrote:
Well I suppose I can write a c library for outputting utf8 encoded lua strings. I hadn't considered the case where people are using `print` on raw bytes and not in a textual way. When I said "reading input" I simply meant I have a function with some input string that reads codepoints from said input, I am outputting some substrings of interest via print. The speicifc case that provoked this was actually a string literal in a test source file.
On Thu, Jan 14, 2021, 18:05 Viacheslav Usov <via.usov@gmail.com> wrote:
On Wed, Jan 13, 2021 at 2:20 PM Marcus Mason <m4gicks@gmail.com> wrote:
>
> I have code that reads input using the utf8 library. I was testing it with some string literals and noticed my output was nonsense. I then tried the same code on linux and works correctly.

I am not sure what utf8 library that is. Lua's built-in utf8 library
does not read any inputs, it supports some UTF-8 manipulations. But I
suppose that means you have some strings with UTF-8 encoded data,
which you want to print.

What should be understood here is that Lua strings are (immutable)
sequences of bytes (pretty much like in C). Lua does not know nor does
it expect that these sequences should carry UTF-8 encoded data, ASCII
encoded data, or any other kind of "encoded text". It is binary data
to Lua and it does not transform the binary data in any way.

This is in contrast to Python (as one example), where strings are
Unicode text. Note I said "Unicode", not UTF-8 because its internal
representation is generally not UTF-8.

Lua expects the user to know what the user's strings contain. If the
user wants to print strings, then it assumes they are in whatever
encoding the user's underlying C + OS stack expects. On most Unix-like
systems, that would normally be UTF-8 these days. On most Windows
systems, normally NOT these days.

Again, in the Python example, it knows that the user strings are
Unicode data (not UTF-8), so then it uses a way appropriate for the
given OS to print its Unicode strings as Unicode. On most Unix-like
systems, that would be by encoding the data as UTF-8. On WIndows, that
would be by encoding the data as UTF-16.

So you cannot compare Lua to Python in this respect, nor can you say
that Lua works "correctly" on Linux and not so on Windows. It is
_your_ Lua code that is correct on Linux (assuming UTF-8 would work)
and it is also _your_ code that is not correct on Windows (under the
same assumption).

Cheers,
V.