lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Jun 15, 2013, at 2:13 PM, Pierre-Yves Gérardy wrote:

> On Sat, Jun 15, 2013 at 3:52 PM, Roberto Ierusalimschy
> <roberto@inf.puc-rio.br> wrote:
>> 
>> You can already easily implement this ǵetchar' in standard Lua (except
>> that it assumes a well-formed string):
>> 
>>  S = "∂ƒ"
>>  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1))  --> '∂', 4
>>  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4))  --> 'ƒ', 6
>>  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6))  --> nil
> 
> Thanks for this pattern trick, it helped me to improve my
> `getcodepoint()` routine (although I eventually found a faster
> method). A validation routine won't be of much help
> if you deal with a document that sports multiple encodings.

I tend to think, "strings are components of documents"; when using the XML Infoset as a model, encodings can be abstracted away. But concrete strings are different. In general, people don't have individual strings with multiple encodings; people have bugs.[1]

> If you want to validate the characters on the go and always get a position as
> second argument, you need something like this:
> 
> If the character is not valid, it returns `false, position`. At the
> end of the stream, it returns nil, position + 1.

I don't understand where "false" instead of an error would be useful. Once you've decided to iterate over a string as UTF-8, it is a surprise when the string turns out not to be UTF-8, and it's unlikely your code will do anything useful. There could be a separate utf8.isvalid(s, [byteoffset [, bytelen]]) for when you're testing.

I am one of those "assert everything" fascists, though. Code encountering "false" in place of an expected string often blows up anyway (although convenience functions which auto-coerce to string can hide that). The question is how promptly the error is signaled.

>            --UTF-16 surrogate code point checking left out for clarity.

...plus the stuff over U+10FFFF...

> This is not that complex, but still rather slow in Lua, and the same
> goes for getting the code point to perform a range query (useful to
> test if a code point is part of some alphabet).
> 
> To that, end, you could provide a `utf8.range(char, lower, upper)`, though.

UTF-8 is constructed such that Unicode code points are ordered lexicographically under 8-bit strcmp. So you can replace that with

function utf8.inrange(str single_codepoint, str lower_codepoint, str upper_codepoint)
  return single_codepoint >= lower_codepoint and single_codepoint <= upper_codepoint;
end

and you don't need to extract the codepoint from a longer string if you write "< upper_codepoint_plus_one"; this lets you test an arbitrary byte offset for range membership. All of these nice properties go to hell if invalid UTF-8 creeps in, though.

Languages like Lua tend to be very slow when operating character by character. I think there is some kind of map/collect primitive for working with codepoints which probably needs to be in C for speed. Because so many functions on Unicode are sparse, something like map_table[page][offset] is useful, especially if those tables have metatables which can answer with a function and optionally populate them lazily.

Jay

[1]: If somebody hands you a URL, you can't round-trip the %-encoding through Unicode; it must be preserved as US ASCII. Casual URL manipulation is full of string-typing bugs. I wrote a web server which used %23 in URLs. Broken web proxies would unescape the string, notice that there was now a "#foo", and truncate the URL at the mistaken fragment identifier.