Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

East-Asian version of Windows (Chinese, Japanese, Korean) define their "ANSI" legacy code page as a multibyte charset (characters are encoded on 1 to 2 bytes), so they CAN use filenames with Chinese characters in these legacy charsets (though with a limited repertoire, a subset of what is available in Unicode). In more recent version of Windows, these charsets were updated to support GB18030, which is the extension of legacy GBK that covers the WHOLE reertoire of Unicode/ISO 10646, while remaining upward compatible with the legacy GBK, by extending some codes to use up to 4 bytes per character).

GB18030 support is mandatory in China and this support was added in Windows about 20 years ago (so that Unicode encodings such as UTF-8 and UTF-16 are not the only option for Chinese).

Today China like all other countries favor the Unicode UTFs because they offer better interoperability and does not require the filesystem to encode dual encodings (UTF-16 for Windows Unicode, and legacy ISO 88559-*/ANSI/OEM/GB*/HK*/KOI* charsets for the old console, all of them having Windows codepages). Even the Console now supports Unicode ("CHCP 65000" or "CHCP 650001" codepages for UTF-8 and UTF-16). You can still use the legacy Chinese charsets on Windows, the the old GBK codepage may sometimes be lossy when outputting the Unicode result of Windows console apps.

If you use NTFS, the storage legacy charset filenames can be disabled (and it is now disabled by default in Windows 10, you could disable it in Windows 7/8/8.1 on NTFS volumes, and the generation of 8.3 short filenames is no longer necessary; with FAT32, the LFN extension for "long filenames" has been made to use Unicode UTF-16 natively).

Legacy charsets are just there for compatiblity with Windows XP when using external drives (such as USB) formatted with FAT32.

On NTFS, only UTF-16 is needed, the NTFS volume already contains a special hidden file named "\$UpCase" to support correct indexing and sorting of case-insensitive filenames for searches and listing directory contents, even if the unicode version is then updated. The conversion from UTF-16 to legacy charsets is made on the fly by the kernel, using this mapping file when needed (because Unicode is versioned).

I see no real reasons to continue using any Windows app compiled with the legacy "ANSI" APIs of Win32. Everyone now compiles with "UNICODE" (when using the Win32 API) and Unicode is also the default for the .Net and UWX APIs. Now Windows 10 has started deprecating the Win32 API.

The Windows console is now fully compliant to Unicode. There just remains internal codepages used in legacy drivers used at boot time, but they are also being migrated to support Unicode natively (this is the case now of all builtin Microsoft drivers and drivers from wellknown vendors, to get the WHQL certification; and with the security requirement of windows 10, requiring signature and WHQL certification to support secure boot, manufacturers have no choice: they must support Unicode or their devices won't be availabel at boot time, notably storage and input devices, as well as display devices: if these devices don't meet this requirement, Windows will only boot with its builtin compatiblity drivers, using software emulation, and these devices will be slow, without acceleration, which will only be turned on after the graphics environment is loaded, provided that the user-mode helper drivers loaded after are also compiled with Unicode).

You actually no longer need any filesystems with legacy charsets except for external storage drives used by small devices (such as USB flash keys, or SD cards): but actualyl these devices do not even need these charsets and the filenames they use (e.g. for naming photos, or for storing flashable firmwares) are reduced to ASCII only. The only case is when you transfer some music/videos on a falsh drive to play it on a TV or audio system, so that they display the filenames correctly in their menu instead of just garbage boxes or "?" signs (most of these devices will only support FAT32 possibly they may recognize the LFN extension which is encoded with 16-bit Unicode, and so that they actually won't use the legacy 8.3 names encoded with the leacy 8-bit codepages; the legacy 8.3 names were anyway not interoperablebecause FAT did not explicitly stored in their volume metadata descriptor which codepage they were encoded with, so these filenames were already interpreted differently depending on the default system locale of the OS mouting the volume: modern OSes ignore these legacy filenames if Unicode LFN filenames are present, and the genreation of new 8.3 filenames on these volumes is notoriously unreliable if you do that on a FAT32 volume remounted between different host systems; you can still run "CHKDSK" on these devices to fix the mixed encoding of these 8.3 filenames accordin to the LFN Unicode filenames).

So I see no interest in your development, except to support Windows 95/98/XP, whose support is now terminated by Microsoft, and whose security is now too much compromized, and to support the lgacy compatibility modes for Windows 7/8/8.1 whose support is also terminated (even if theyr still have security patches).

We could make the same remark about legacy charsets used in Linux and Unix and legacy protocols (like FTP, or HTML4): they are deprecated. Everyone should use Unicode by default (either UTF-8 or UTF-16). Modern installations of Linux all use UTF-8 now by default and legacy charsets have limtied support and cause bugs (which can turn into security risks caused by unexpected filenames clashes, and security risks are so important today that people don't want to assume it, and even manufacturers, OEMs, and OS providers don't want to assume it). Expect all these legacy charsets to die now. Unicode is now used in data much often all data available in all other legacy codepages (this is even true in China now, where the mandatory support of GB18030 in systems is no longer used, given that Unicode is fully interoperably with GB18030 but has lower cost).

There's stil lthe default "OSM" charset of the console in Windows which still use them, but we should turn to use codepages 65000 and 65001 (UTF8 and UTF-16) instead of everything else. Everything can work with Unicode only.

Le dim. 20 janv. 2019 à 12:13, Egor Skriptunoff <egor.skriptunoff@gmail.com> a écrit :

On Sat, Jan 19, 2019 at 5:10 PM Viacheslav Usov wrote:
On Thu, Jan 17, 2019 at 10:50 PM Egor Skriptunoff wrote:
If you are creating portable Lua script (Linux/Windows/MacOS)
then you have a problem: standard Lua functions such as "io.open"
expect a filename encoded in UTF-8 on all "normal OS",
but on Windows a filename must be in some Windows-specific encoding
depending on the locale.

There is a pure Lua solution to this problem.

There is no pure Lua solution to this problem.

You are aware of this fact and you mentioned a further constraint in your code:
-- Please note that filenames must contain only symbols from your Windows ANSI codepage (which depends on OS locale).
-- Unfortunately, it's impossible to work with a file having arbitrary UTF-8 symbols in its name.
Practically, if your code page is Cyrillic, you cannot specify a file with a Chinese name even though the file exists.

From the perfectionism point of view, you're correct :-)
But in practice, most use cases are covered by my module.

For example, on all my computers there are no files containing Chinese characters in its filenames.
Assuming you are not speaking Chinese, I want to ask you, do you have such filenames on your machines? ;-)

My module allows a user to use its native language in filenames ON HIS OWN COMPUTER.
But it's not a good idea to have, for example, Chinese filenames on a computer where user don't speak Chinese.
Although it's technically possible in all modern OS, it's inconvenient for the user.
User should be able to understand the meaning of a filename.
(If a filename is not supposed to be human-understandable, it would be better consisted of digits/hexadecimals/GUIDs/etc. instead of human language words)

P.S. Sorry, I wasn't honest enough.
It appears that I do have some filenames containing non-Cyrillic symbols on my computer.
But anyway, such files shouldn't be considered seriously as they all are inside "porn" folder :-)