Re: is string.gmatch(), string.upper() 7-bit ascii only?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: is string.gmatch(), string.upper() 7-bit ascii only?
From: sur-behoffski <sur_behoffski@...>
Date: Sun, 10 Apr 2016 08:22:52 +0930

The wider topic of the C locale, ASCII and 7 bits is currently being
discussed in GNU Grep.  Quoting message 26 in 23234@debbugs.gnu.org:

    On 04/06/2016 02:04 PM, Eric Blake wrote:
    > POSIX ... says that LC_ALL=C is _required_ to treat all 256 byte values as
    > valid characters

    Although that was the intent of POSIX, it's not what the current standard
    says, and it's not what many popular platforms do. Problematic platforms
    include Fedora 23, where mbrtowc reports an encoding error in the C locale
    when given a byte outside the range 0-127. This affects many programs other
    than 'grep'.

    This bug in the standard is intended to be fixed in a future version of
    POSIX (see <http://austingroupbugs.net/view.php?id=663#c2738>). I suppose
    glibc and eventually Fedora will be fixed to conform to the new standard in
    due course.

    Perhaps grep should work around this problem on systems like Fedora 23 where
    the underlying C library does not conform to the next version of POSIX.  It
    sounds like a new gnulib module or two might do the trick. This should fix
    the problems that Björn mentions.

    In the meantime grep -a is the way to go. Yes, it's not portable to non-GNU
    grep, but there is no portable solution given the abovementioned POSIX
    problems, so a GNU-grep-only workaround is all one can reasonably ask for.

Also, there's a number of hairy cases in GNU grep regarding Unicode and
upper/lower case handling, as Grep tries to provide case-insensitive matching:

    /* The set of wchar_t values C such that there's a useful locale
        somewhere where C != towupper (C) && C != towlower (towupper (C)).

        For example, 0x00B5 (U+00B5 MICRO SIGN) is in this table, because:

        towupper (0x00B5) == 0x039C (U+039C GREEK CAPITAL LETTER MU), and
        towlower (0x039C) == 0x03BC (U+03BC GREEK SMALL LETTER MU).
    */

Grep's definition of case-insensitive matching is effectively (pseudocode):

    #define CI_MATCHES(a, b)  (towupper(a) == towupper(b))

Even within the Basic Multilingual Plane, holes have deliberately been left
by various encoding sets at various points:  E.g., from the Wikipedia article
for IEC_10646:

    The system deliberately leaves many code points not assigned to characters,
    even in the BMP. It does this to allow for future expansion or to minimize
    conflicts with other encoding forms.

---------

So, bringing the focus back to Lua, which uses the system's libc in order to
be portable, the ISO-8859-1 locale, rather than the C or POSIX locale, might
be a useful locale for simple cases.  However, my experience in using anything
other than C or English locales is very limited (both at the terminal emulation
level, and the within-Lua string handling level), so I'll stop here, and let
more experienced people speak.

cheers,

sur-behoffski (Brenton Hoff)
Programmer, Grouse Software.

Prev by Date: Re: [ANN] Ag 1.1 -- a fast, scriptable anagram generator
Next by Date: Re: [ANN] Ag 1.1 -- a fast, scriptable anagram generator
Previous by thread: Re: is string.gmatch(), string.upper() 7-bit ascii only?
Next by thread: Re: LPEG > 0.10 regression: 'B' (pattern may not have fixed length)
Index(es):
- Date
- Thread