(Re: utf8lib.c...) Case sensitivity, locales, and character encoding: GNU Grep!

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: (Re: utf8lib.c...) Case sensitivity, locales, and character encoding: GNU Grep!
From: behoffski <behoffski@...>
Date: Wed, 14 May 2014 17:34:07 +0930

G'day,

[Writing in partial response to utf8 discussions on this list, under
the Subject: Re: byteoffset() in lutf8lib.c from 5.3, work.  I read
the list via the digest, so message threading etc is a bit tricky.]

I'm currently mainly working on GNU Grep, hoping to eventually add some
code speedups I found some years ago, but which may have been overtaken
by the excellent work of others.

I found some of the core code to be too "Big and Hairy" to approach,
and I saw that on the project's TODO list was a desire to refactor some
of the code into more manageable pieces, so I decided to tackle that.

Using a Lua script as the heart of the approach.

Basically, I expected to take at least 6 months, and perhaps more than a
year, before my contribution could be deemed interesting/worthwhile by
others, so I anticipated that as I tried to modularise the code, I would
need to do so non-invasively, while others were busy changing the code
in a more linear fashion.  The Lua script is designed to let me break
the sources into pieces (sometimes lines; often paragraphs; sometimes
much larger chunks), and then re-assemble the code in the desired
fashion, using Lua's string processing facilities to help change the code
as desired.  As the underlying sources change, I modify the segment
identification to match the new code, and then most of the uses of the
segments can occur without further effort.

The script, called "untangle", is itself released under the GPL3 license
and is one of the attachments of this list message:

        http://lists.gnu.org/archive/html/bug-grep/2014-04/msg00127.html

I'm preparing to release a resynchronised and improved version of the
script, when the next release of GNU Grep has been finalised and the
dust has settled a bit.  Still no idea if my contribution will gain any
traction, but I'm having a go.

Note that GNU Grep is GPL3, so you may need to avoid reading the code if
this conflicts with other requirements/commitments/conditions within
your environment.

----

Anyway, a significant portion of GNU Grep is about locales, both unibyte
and multibyte, and issues such as case sensitivity and input validation.

For case insensitivity, there's not just uppercase and lowercase, but
also titlecase!  ...and POSIX is apparently a bit unclear:

        http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16919

so Grep currently follows the convention established by GNU Regex, that
is:

        Characters A and B, are considered equal, case-wise, if:

                to_uppercase (A) == to_uppercase (B)

Hope this is helpful.

cheers,

behoffski (Brenton Hoff)
Programmer, Grouse Software

Prev by Date: Re: Shared libraries
Next by Date: index of literal tables.
Previous by thread: Re: Lpeg Cg question
Next by thread: index of literal tables.
Index(es):
- Date
- Thread