Re: byteoffset() in lutf8lib.c from 5.3, work2

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: byteoffset() in lutf8lib.c from 5.3, work2
From: Sean Conner <sean@...>
Date: Tue, 13 May 2014 20:31:27 -0400

It was thus said that the Great Coda Highland once stated:
> On Tue, May 13, 2014 at 4:52 PM, Tim Hill <drtimhill@gmail.com> wrote:
> >
> > On May 13, 2014, at 4:45 PM, Sean Conner <sean@conman.org> wrote:
> >
> >> It was thus said that the Great Coroutines once stated:
> >>> On Tue, May 13, 2014 at 3:41 PM, Sean Conner <sean@conman.org> wrote:
> >>>
> >>>>  If you are curious, check out the source code to joe (Joe's Editor),
> >>>> specifically, the files i18n.c and utf8.c, to see just the amount of code
> >>>> required to maybe, hopefully, handle UTF-8.  I have no idea how well it
> >>>> deals with right-to-left languages.
> >>>
> >>> https://github.com/paul-schwendenman/joe-editor/blob/master/joe/i18n.c
> >>>
> >>> I am not a fan of the proliferation of wide characters :(
> >>
> >>  Well, if you want to handle checking for control characters, spaces, upper
> >> case, lower case, numbers, combining characters or punctuation ...
> >>
> >>  -spc (i18n.c and utf8.c compile to about 31k on a 32-bit system ... )
> >>
> >>
> >
> > Just editorializing for a moment, when it first appeared Unicode was
> > supposed to clean up the mess with codepages, all the various odd
> > multi-byte character hacks (shift-JIS anyone?) and make multi-lingual
> > applications far easier to code. Fast forward and I’m not sure that the
> > “cure” is any better than the original problem. Any standard that has a
> > “normalized” form that is in fact FOUR different forms is in trouble
> > imho.
> >
> > —Tim
> 
> I mildly disagree. While I agree that Unicode isn't perfect, I think
> it HAS successfully addressed the goals it set out to accomplish. In
> my opinion and experience, Unicode is better than any extant
> alternative.

  In Turkish [1], the upper case version of 'i' is 'İ' (a capital 'I' with a
dot), while the lower case of 'I' is 'ı' (a lowercase 'i' without the dot). 
With codepages, you stood a chance of knowing you were working with Turkish. 
With Unicode?

  -spc (I suppose you could check $LANG ... )

[1]	http://en.wikipedia.org/wiki/Dotted_and_dotless_I

Follow-Ups:
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Coda Highland

References:
- byteoffset() in lutf8lib.c from 5.3, work2, Coroutines
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Roberto Ierusalimschy
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Coroutines
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Sean Conner
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Coroutines
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Sean Conner
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Tim Hill
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Coda Highland

Prev by Date: Re: Lpeg Cg question
Next by Date: Re: byteoffset() in lutf8lib.c from 5.3, work2
Previous by thread: Re: byteoffset() in lutf8lib.c from 5.3, work2
Next by thread: Re: byteoffset() in lutf8lib.c from 5.3, work2
Index(es):
- Date
- Thread