Re: byteoffset() in lutf8lib.c from 5.3, work2

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: byteoffset() in lutf8lib.c from 5.3, work2
From: Coda Highland <chighland@...>
Date: Tue, 13 May 2014 17:42:33 -0700

On Tue, May 13, 2014 at 5:31 PM, Sean Conner <sean@conman.org> wrote:
>   In Turkish [1], the upper case version of 'i' is 'İ' (a capital 'I' with a
> dot), while the lower case of 'I' is 'ı' (a lowercase 'i' without the dot).
> With codepages, you stood a chance of knowing you were working with Turkish.
> With Unicode?
>
>   -spc (I suppose you could check $LANG ... )
>
> [1]     http://en.wikipedia.org/wiki/Dotted_and_dotless_I

Except with code pages, you can't even guarantee that a given series
of bytes is even being mapped to the right CHARACTERS unless you have
metadata. Unicode is at least a step ahead in that sense; there's no
ambiguity in that direction (though of course we've discussed the
ambiguity in the OTHER direction). Knowing that you're dealing with
Turkish text is still a matter of metadata; that hasn't changed.

/s/ Adam

References:
- byteoffset() in lutf8lib.c from 5.3, work2, Coroutines
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Roberto Ierusalimschy
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Coroutines
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Sean Conner
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Coroutines
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Sean Conner
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Tim Hill
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Coda Highland
- Re: byteoffset() in lutf8lib.c from 5.3, work2, Sean Conner

Prev by Date: Lpeg Cg question
Next by Date: Re: Lpeg Cg question
Previous by thread: Re: byteoffset() in lutf8lib.c from 5.3, work2
Next by thread: Re: byteoffset() in lutf8lib.c from 5.3, work2
Index(es):
- Date
- Thread