[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua
- From: "Soni \"They/Them\" L." <fakedme@...>
- Date: Sun, 20 Jan 2019 13:14:01 -0200
On 2019-01-20 10:15 a.m., Philippe Verdy wrote:
East-Asian version of Windows (Chinese, Japanese, Korean) define their
"ANSI" legacy code page as a multibyte charset (characters are encoded
on 1 to 2 bytes), so they CAN use filenames with Chinese characters in
these legacy charsets (though with a limited repertoire, a subset of
what is available in Unicode). In more recent version of Windows,
these charsets were updated to support GB18030, which is the extension
of legacy GBK that covers the WHOLE reertoire of Unicode/ISO 10646,
while remaining upward compatible with the legacy GBK, by extending
some codes to use up to 4 bytes per character).
GB18030 support is mandatory in China and this support was added in
Windows about 20 years ago (so that Unicode encodings such as UTF-8
and UTF-16 are not the only option for Chinese).
Today China like all other countries favor the Unicode UTFs because
they offer better interoperability and does not require the filesystem
to encode dual encodings (UTF-16 for Windows Unicode, and legacy ISO
88559-*/ANSI/OEM/GB*/HK*/KOI* charsets for the old console, all of
them having Windows codepages). Even the Console now supports Unicode
("CHCP 65000" or "CHCP 650001" codepages for UTF-8 and UTF-16). You
can still use the legacy Chinese charsets on Windows, the the old GBK
codepage may sometimes be lossy when outputting the Unicode result of
Windows console apps.
If you use NTFS, the storage legacy charset filenames can be disabled
(and it is now disabled by default in Windows 10, you could disable it
in Windows 7/8/8.1 on NTFS volumes, and the generation of 8.3 short
filenames is no longer necessary; with FAT32, the LFN extension for
"long filenames" has been made to use Unicode UTF-16 natively).
Legacy charsets are just there for compatiblity with Windows XP when
using external drives (such as USB) formatted with FAT32.
On NTFS, only UTF-16 is needed, the NTFS volume already contains a
special hidden file named "\$UpCase" to support correct indexing and
sorting of case-insensitive filenames for searches and listing
directory contents, even if the unicode version is then updated. The
conversion from UTF-16 to legacy charsets is made on the fly by the
kernel, using this mapping file when needed (because Unicode is
I see no real reasons to continue using any Windows app compiled with
the legacy "ANSI" APIs of Win32. Everyone now compiles with "UNICODE"
(when using the Win32 API) and Unicode is also the default for the
.Net and UWX APIs. Now Windows 10 has started deprecating the Win32 API.
The Windows console is now fully compliant to Unicode. There just
remains internal codepages used in legacy drivers used at boot time,
but they are also being migrated to support Unicode natively (this is
the case now of all builtin Microsoft drivers and drivers from
wellknown vendors, to get the WHQL certification; and with the
security requirement of windows 10, requiring signature and WHQL
certification to support secure boot, manufacturers have no choice:
they must support Unicode or their devices won't be availabel at boot
time, notably storage and input devices, as well as display devices:
if these devices don't meet this requirement, Windows will only boot
with its builtin compatiblity drivers, using software emulation, and
these devices will be slow, without acceleration, which will only be
turned on after the graphics environment is loaded, provided that the
user-mode helper drivers loaded after are also compiled with Unicode).
You actually no longer need any filesystems with legacy charsets
except for external storage drives used by small devices (such as USB
flash keys, or SD cards): but actualyl these devices do not even need
these charsets and the filenames they use (e.g. for naming photos, or
for storing flashable firmwares) are reduced to ASCII only. The only
case is when you transfer some music/videos on a falsh drive to play
it on a TV or audio system, so that they display the filenames
correctly in their menu instead of just garbage boxes or "?" signs
(most of these devices will only support FAT32 possibly they may
recognize the LFN extension which is encoded with 16-bit Unicode, and
so that they actually won't use the legacy 8.3 names encoded with the
leacy 8-bit codepages; the legacy 8.3 names were anyway not
interoperablebecause FAT did not explicitly stored in their volume
metadata descriptor which codepage they were encoded with, so these
filenames were already interpreted differently depending on the
default system locale of the OS mouting the volume: modern OSes ignore
these legacy filenames if Unicode LFN filenames are present, and the
genreation of new 8.3 filenames on these volumes is notoriously
unreliable if you do that on a FAT32 volume remounted between
different host systems; you can still run "CHKDSK" on these devices to
fix the mixed encoding of these 8.3 filenames accordin to the LFN
So I see no interest in your development, except to support Windows
95/98/XP, whose support is now terminated by Microsoft, and whose
security is now too much compromized, and to support the lgacy
compatibility modes for Windows 7/8/8.1 whose support is also
terminated (even if theyr still have security patches).
We could make the same remark about legacy charsets used in Linux and
Unix and legacy protocols (like FTP, or HTML4): they are deprecated.
Everyone should use Unicode by default (either UTF-8 or UTF-16).
Modern installations of Linux all use UTF-8 now by default and legacy
charsets have limtied support and cause bugs (which can turn into
security risks caused by unexpected filenames clashes, and security
risks are so important today that people don't want to assume it, and
even manufacturers, OEMs, and OS providers don't want to assume it).
Expect all these legacy charsets to die now. Unicode is now used in
data much often all data available in all other legacy codepages (this
is even true in China now, where the mandatory support of GB18030 in
systems is no longer used, given that Unicode is fully interoperably
with GB18030 but has lower cost).
There's stil lthe default "OSM" charset of the console in Windows
which still use them, but we should turn to use codepages 65000 and
65001 (UTF8 and UTF-16) instead of everything else. Everything can
work with Unicode only.
I guess this thing does help with ReactOS support. But I'm sure ReactOS
will catch up to the new stuff at some point and it'll no longer be
Le dim. 20 janv. 2019 à 12:13, Egor Skriptunoff
<email@example.com <mailto:firstname.lastname@example.org>> a écrit :
On Sat, Jan 19, 2019 at 5:10 PM Viacheslav Usov wrote:
On Thu, Jan 17, 2019 at 10:50 PM Egor Skriptunoff wrote:
If you are creating portable Lua script (Linux/Windows/MacOS)
then you have a problem: standard Lua functions such as
expect a filename encoded in UTF-8 on all "normal OS",
but on Windows a filename must be in some Windows-specific
depending on the locale.
There is a pure Lua solution to this problem.
There is no pure Lua solution to this problem.
You are aware of this fact and you mentioned a further
constraint in your code:
-- Please note that filenames must contain only symbols from
your Windows ANSI codepage (which depends on OS locale).
-- Unfortunately, it's impossible to work with a file having
arbitrary UTF-8 symbols in its name.
Practically, if your code page is Cyrillic, you cannot specify
a file with a Chinese name even though the file exists.
From the perfectionism point of view, you're correct :-)
But in practice, most use cases are covered by my module.
For example, on all my computers there are no files containing
Chinese characters in its filenames.
Assuming you are not speaking Chinese, I want to ask you, do you
have such filenames on your machines? ;-)
My module allows a user to use its native language in filenames ON
HIS OWN COMPUTER.
But it's not a good idea to have, for example, Chinese filenames
on a computer where user don't speak Chinese.
Although it's technically possible in all modern OS, it's
inconvenient for the user.
User should be able to understand the meaning of a filename.
(If a filename is not supposed to be human-understandable, it
would be better consisted of digits/hexadecimals/GUIDs/etc.
instead of human language words)
P.S. Sorry, I wasn't honest enough.
It appears that I do have some filenames containing non-Cyrillic
symbols on my computer.
But anyway, such files shouldn't be considered seriously as they
all are inside "porn" folder :-)