Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua
From: "Soni \"They/Them\" L." <fakedme@...>
Date: Sun, 20 Jan 2019 13:14:01 -0200



On 2019-01-20 10:15 a.m., Philippe Verdy wrote:

East-Asian version of Windows (Chinese, Japanese, Korean) define their"ANSI" legacy code page as a multibyte charset (characters are encodedon 1 to 2 bytes), so they CAN use filenames with Chinese characters inthese legacy charsets (though with a limited repertoire, a subset ofwhat is available in Unicode). In more recent version of Windows,these charsets were updated to support GB18030, which is the extensionof legacy GBK that covers the WHOLE reertoire of Unicode/ISO 10646,while remaining upward compatible with the legacy GBK, by extendingsome codes to use up to 4 bytes per character).GB18030 support is mandatory in China and this support was added inWindows about 20 years ago (so that Unicode encodings such as UTF-8and UTF-16 are not the only option for Chinese).Today China like all other countries favor the Unicode UTFs becausethey offer better interoperability and does not require the filesystemto encode dual encodings (UTF-16 for Windows Unicode, and legacy ISO88559-*/ANSI/OEM/GB*/HK*/KOI* charsets for the old console, all ofthem having Windows codepages). Even the Console now supports Unicode("CHCP 65000" or "CHCP 650001" codepages for UTF-8 and UTF-16). Youcan still use the legacy Chinese charsets on Windows, the the old GBKcodepage may sometimes be lossy when outputting the Unicode result ofWindows console apps.If you use NTFS, the storage legacy charset filenames can be disabled(and it is now disabled by default in Windows 10, you could disable itin Windows 7/8/8.1 on NTFS volumes, and the generation of 8.3 shortfilenames is no longer necessary; with FAT32, the LFN extension for"long filenames" has been made to use Unicode UTF-16 natively).Legacy charsets are just there for compatiblity with Windows XP whenusing external drives (such as USB) formatted with FAT32.On NTFS, only UTF-16 is needed, the NTFS volume already contains aspecial hidden file named "\$UpCase" to support correct indexing andsorting of case-insensitive filenames for searches and listingdirectory contents, even if the unicode version is then updated. Theconversion from UTF-16 to legacy charsets is made on the fly by thekernel, using this mapping file when needed (because Unicode isversioned).I see no real reasons to continue using any Windows app compiled withthe legacy "ANSI" APIs of Win32. Everyone now compiles with "UNICODE"(when using the Win32 API) and Unicode is also the default for the.Net and UWX APIs. Now Windows 10 has started deprecating the Win32 API.
The Windows console is now fully compliant to Unicode. There justremains internal codepages used in legacy drivers used at boot time,but they are also being migrated to support Unicode natively (this isthe case now of all builtin Microsoft drivers and drivers fromwellknown vendors, to get the WHQL certification; and with thesecurity requirement of windows 10, requiring signature and WHQLcertification to support secure boot, manufacturers have no choice:they must support Unicode or their devices won't be availabel at boottime, notably storage and input devices, as well as display devices:if these devices don't meet this requirement, Windows will only bootwith its builtin compatiblity drivers, using software emulation, andthese devices will be slow, without acceleration, which will only beturned on after the graphics environment is loaded, provided that theuser-mode helper drivers loaded after are also compiled with Unicode).
You actually no longer need any filesystems with legacy charsetsexcept for external storage drives used by small devices (such as USBflash keys, or SD cards): but actualyl these devices do not even needthese charsets and the filenames they use (e.g. for naming photos, orfor storing flashable firmwares) are reduced to ASCII only. The onlycase is when you transfer some music/videos on a falsh drive to playit on a TV or audio system, so that they display the filenamescorrectly in their menu instead of just garbage boxes or "?" signs(most of these devices will only support FAT32 possibly they mayrecognize the LFN extension which is encoded with 16-bit Unicode, andso that they actually won't use the legacy 8.3 names encoded with theleacy 8-bit codepages; the legacy 8.3 names were anyway notinteroperablebecause FAT did not explicitly stored in their volumemetadata descriptor which codepage they were encoded with, so thesefilenames were already interpreted differently depending on thedefault system locale of the OS mouting the volume: modern OSes ignorethese legacy filenames if Unicode LFN filenames are present, and thegenreation of new 8.3 filenames on these volumes is notoriouslyunreliable if you do that on a FAT32 volume remounted betweendifferent host systems; you can still run "CHKDSK" on these devices tofix the mixed encoding of these 8.3 filenames accordin to the LFNUnicode filenames).
So I see no interest in your development, except to support Windows95/98/XP, whose support is now terminated by Microsoft, and whosesecurity is now too much compromized, and to support the lgacycompatibility modes for Windows 7/8/8.1 whose support is alsoterminated (even if theyr still have security patches).
We could make the same remark about legacy charsets used in Linux andUnix and legacy protocols (like FTP, or HTML4): they are deprecated.Everyone should use Unicode by default (either UTF-8 or UTF-16).Modern installations of Linux all use UTF-8 now by default and legacycharsets have limtied support and cause bugs (which can turn intosecurity risks caused by unexpected filenames clashes, and securityrisks are so important today that people don't want to assume it, andeven manufacturers, OEMs, and OS providers don't want to assume it).Expect all these legacy charsets to die now. Unicode is now used indata much often all data available in all other legacy codepages (thisis even true in China now, where the mandatory support of GB18030 insystems is no longer used, given that Unicode is fully interoperablywith GB18030 but has lower cost).
There's stil lthe default "OSM" charset of the console in Windowswhich still use them, but we should turn to use codepages 65000 and65001 (UTF8 and UTF-16) instead of everything else. Everything canwork with Unicode only.

I guess this thing does help with ReactOS support. But I'm sure ReactOSwill catch up to the new stuff at some point and it'll no longer benecessary.

Le dim. 20 janv. 2019 à 12:13, Egor Skriptunoff<egor.skriptunoff@gmail.com <mailto:egor.skriptunoff@gmail.com>> a écrit :


    On Sat, Jan 19, 2019 at 5:10 PM Viacheslav Usov wrote:

        On Thu, Jan 17, 2019 at 10:50 PM Egor Skriptunoff wrote:

            If you are creating portable Lua script (Linux/Windows/MacOS)
            then you have a problem: standard Lua functions such as
            "io.open"
            expect a filename encoded in UTF-8 on all "normal OS",
            but on Windows a filename must be in some Windows-specific
            encoding
            depending on the locale.

            There is a pure Lua solution to this problem.


        There is no pure Lua solution to this problem.

        You are aware of this fact and you mentioned a further
        constraint in your code:
        -- Please note that filenames must contain only symbols from
        your Windows ANSI codepage (which depends on OS locale).
        -- Unfortunately, it's impossible to work with a file having
        arbitrary UTF-8 symbols in its name.
        Practically, if your code page is Cyrillic, you cannot specify
        a file with a Chinese name even though the file exists.


    From the perfectionism point of view, you're correct :-)
    But in practice, most use cases are covered by my module.

    For example, on all my computers there are no files containing
    Chinese characters in its filenames.
    Assuming you are not speaking Chinese, I want to ask you, do you
    have such filenames on your machines? ;-)

    My module allows a user to use its native language in filenames ON
    HIS OWN COMPUTER.
    But it's not a good idea to have, for example, Chinese filenames
    on a computer where user don't speak Chinese.
    Although it's technically possible in all modern OS, it's
    inconvenient for the user.
    User should be able to understand the meaning of a filename.
    (If a filename is not supposed to be human-understandable, it
    would be better consisted of digits/hexadecimals/GUIDs/etc.
    instead of human language words)

    P.S.  Sorry, I wasn't honest enough.
    It appears that I do have some filenames containing non-Cyrillic
    symbols on my computer.
    But anyway, such files shouldn't be considered seriously as they
    all are inside "porn" folder  :-)

References:
- [ANN] Working with UTF-8 filenames on Windows in pure Lua, Egor Skriptunoff
- Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua, Viacheslav Usov
- Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua, Egor Skriptunoff
- Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua, Philippe Verdy

Prev by Date: Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua
Next by Date: Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua
Previous by thread: Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua
Next by thread: Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua
Index(es):
- Date
- Thread