[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: proposal for reading individual characters from strings faster
- From: Sean Conner <sean@...>
- Date: Sat, 3 May 2014 16:30:07 -0400
It was thus said that the Great Tim Hill once stated:
>
> However, no “simple” feature comes without hidden costs. The back-quote
> syntax appears to isolate source code from character coding issues, but
> does it? One approach is to always assume UTF-8 encoding, which is
> consistent across platforms, but may differ from the local encoding. This
> means that `a` ~= string.byte(“a”) on (say) EBCDIC platforms. Another
> approach is to use the local platform encoding, but this also doesn’t work
> since the locale at compile time may differ from the locale at run-time
> (even if the code is run directly after compile).
It can even change at runtime!
One project I've been working on [1] involves parsing email [2] which
involves a lot of character set manipulations (not dealt with in [2]). The
collection of emails I pull from uses at least a dozen, if not more,
character sets.
-spc
[1] Long term, when I get around to it, not really important, but a fun
diversion. That type of project.
[2] Obligatory email header parsing code:
https://github.com/spc476/LPeg-Parsers/blob/master/email.lua