[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: is string.gmatch(), string.upper() 7-bit ascii only?
- From: sur-behoffski <sur_behoffski@...>
- Date: Sun, 10 Apr 2016 08:22:52 +0930
The wider topic of the C locale, ASCII and 7 bits is currently being
discussed in GNU Grep. Quoting message 26 in 23234@debbugs.gnu.org:
On 04/06/2016 02:04 PM, Eric Blake wrote:
> POSIX ... says that LC_ALL=C is _required_ to treat all 256 byte values as
> valid characters
Although that was the intent of POSIX, it's not what the current standard
says, and it's not what many popular platforms do. Problematic platforms
include Fedora 23, where mbrtowc reports an encoding error in the C locale
when given a byte outside the range 0-127. This affects many programs other
than 'grep'.
This bug in the standard is intended to be fixed in a future version of
POSIX (see <http://austingroupbugs.net/view.php?id=663#c2738>). I suppose
glibc and eventually Fedora will be fixed to conform to the new standard in
due course.
Perhaps grep should work around this problem on systems like Fedora 23 where
the underlying C library does not conform to the next version of POSIX. It
sounds like a new gnulib module or two might do the trick. This should fix
the problems that Björn mentions.
In the meantime grep -a is the way to go. Yes, it's not portable to non-GNU
grep, but there is no portable solution given the abovementioned POSIX
problems, so a GNU-grep-only workaround is all one can reasonably ask for.
Also, there's a number of hairy cases in GNU grep regarding Unicode and
upper/lower case handling, as Grep tries to provide case-insensitive matching:
/* The set of wchar_t values C such that there's a useful locale
somewhere where C != towupper (C) && C != towlower (towupper (C)).
For example, 0x00B5 (U+00B5 MICRO SIGN) is in this table, because:
towupper (0x00B5) == 0x039C (U+039C GREEK CAPITAL LETTER MU), and
towlower (0x039C) == 0x03BC (U+03BC GREEK SMALL LETTER MU).
*/
Grep's definition of case-insensitive matching is effectively (pseudocode):
#define CI_MATCHES(a, b) (towupper(a) == towupper(b))
Even within the Basic Multilingual Plane, holes have deliberately been left
by various encoding sets at various points: E.g., from the Wikipedia article
for IEC_10646:
The system deliberately leaves many code points not assigned to characters,
even in the BMP. It does this to allow for future expansion or to minimize
conflicts with other encoding forms.
---------
So, bringing the focus back to Lua, which uses the system's libc in order to
be portable, the ISO-8859-1 locale, rather than the C or POSIX locale, might
be a useful locale for simple cases. However, my experience in using anything
other than C or English locales is very limited (both at the terminal emulation
level, and the within-Lua string handling level), so I'll stop here, and let
more experienced people speak.
cheers,
sur-behoffski (Brenton Hoff)
Programmer, Grouse Software.