|
On 6/27/2017 6:56 PM, Duane Leslie
wrote:
Hi, I had a problem with the quoted string format producing strings that were not legal UTF-8 because it was not escaping non-ascii characters, and then once I fixed that I wasn't able to read the strings back in to a C program because C uses octal escapes and Lua uses decimal. But string.format is not documented to produce a legal C string literal. See https://www.lua.org/manual/5.3/manual.html#pdf-string.format where it says "The q option formats a string between
double quotes,
using escape sequences when necessary to ensure that
it can safely be read back by the Lua interpreter." Note that it
explicitly does not mention that the result could be safely read by
a C compiler.Within string literals, Lua is already perfectly fine with UTF-8 content. It may also be fine with other extended ASCII forms, or with some other Unicode translation formats, leaving most of those details up to the system outside of Lua. But don't use UTF-7. Just don't. This patch ensures all control and non-ascii characters are escaped, and uses the hexadecimal escape syntax instead of decimal to ensure compatibility between Lua and C. Here you are using "Lua" to mean Lua 5.2 or later. The still widely used Lua 5.1 did not support hex escapes. Technically it is still not safe to pass the strings as literals directly into C because in C the hexadecimal production is not automatically terminated at two characters but I figured this was outside of the scope of the quoted string format specifier. I solve this instead by using `:gsub([[%f[\]\x%x%x]],'%0""')` to terminate the hexadecimal escapes (triggering C's string literal concatenation behaviour) at the point of export. You would be far better served by writing a Lua function that generates a proper C string literal and calling that instead of depending on string.format("%q") and additional processing. Lua is designed to work well with C. It is also designed to be used by people who don't want to know anything about C or lower level programming issues. While the choice of base-10 for \ddd escapes is occasionally a source of friction when switching back and forth between the languages, it is no worse that the choice of 1-based array indexing and numerous other details that differ. -- Ross Berteig Ross@CheshireEng.com Cheshire Engineering Corp. http://www.CheshireEng.com/ +1 626 303 1602 |