[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: [Q] handling 0xC2A0 (space in utf8)
- From: "V S P" <toreason@...>
- Date: Thu, 16 Oct 2008 15:20:18 -0400
Hi,
I am new to the list.
I have serached the archives but did not find
(or perhaps did not quite understand) how to
handle UTF-8 characters in Lua and LPeg.
Here is the situation,
I wrote a very simple domain-specific language compiler
using LPeg and Lua. The source text comes from a web page
(I use dojo tookit dijit.Editor to provide text editing) and the page is
encoded in UTF8.
(and I am new to lua, compiler writing and
web development, so I may be missing something fundamental).
The text gets sent via JSON to my PHP, in PHP I remove HTML tags
and replace <br /> with new end of line character (PHP_EOL)
/* Convert to UTF-8 before doing anything else, input is also utf-8
so this is just in case */
$utf8_text = iconv( 'utf-8', "utf-8", $raw_text );
/* Strip HTML tags */
$utf8_text = strip_tags( $utf8_text );
/* Decode HTML entities */
$utf8_text = html_entity_decode( $utf8_text, ENT_COMPAT, "UTF-8" );
Then I simply invoke
lua using proc_open and pass the source text from the web page
via STDIN to lua (note, it is not possible to open STDIN as 'binary'
file)
In lua, I have specifed for LPEG the following grammar for space
local space=lpeg.S('\r\n\f\t ')^1
I have spent about 2 days to figure out why certain things do not
get recognized by my compiler, and it turns out that it only happens if
I put a space preceeding my identifiers in the web page.
So I am tracing it down from the web page, to json, to PHP, to lua
and see that I am getting 0xC2A0 from the web page when I enter
the 'space'.
I am thinking now that this messes up LPeg when trying to match
for the space. I would like to tell LPeg to also understand
0xC2A0 as a space.
But I do not know how to do that, this would be ideal solution though,
I think.
Alternatively, may be there is a way to replace 0xC2A0 with
ascii space (as I for my purposes -- space is a 'dead' symbol
just used to separate keywords and operators)
But again, I do not know how to do that either -- as I am somewhat
confused between hex notation, two bytes and strings.
I am using Lua 5.1.3 and LPeg 0.8x
On a side note -- I really like Lua and LPEG is very easy for my
simple brain to understand. Between lua, php and javascript -- it all
seems like one language with closures, iterators, dynamic
classes/methods
with just different libraries (although Lua has the more elegant syntax
of the 3 :-), but PHP has lots more functions and soon closures )
Thank you in advance
--
V S P
toreason@fastmail.fm
--
http://www.fastmail.fm - Email service worth paying for. Try it for free