lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Yes but a string may *embed* null bytes that your don't want to match. The fact that strings are also surrounded by nul bytes just makes matching a bit faster (but the presence of a null byte is not necessarily a start or end of string); in fact the start of string does not even need to be scanned, patterns are matched only in forward direction. So those nul bytes are just guards that facilitates the transmission as classic null-terminated C strings to some API, without always requiring to use extra buffers. (those APIs won't work correctly if the strings are embedding null bytes as they could handle only a truncated value, ignoring the rest).
In general you should avoid using code that depends on null bytes: patterns using the standard '$' anchor is recommended.

Also note that there are also other packages extending the string packages to support UTF-8 or other multibyte UTF's (needing embedded null bytes) or encodings (notably JIS- and CJK-based). As well Lua strings are jsut arrays of bytes with no or little interpretation, they are usable for storing binary-encoded objects (which is much more efficient and compact than using arrays of numbers for byte values), as Lua offers no other native support for "arbitrary byte buffers". That's why the lua string package is so reduced in terms of functionalities (and patterns were also voluntarily limited: for PCRE regexps, you need to use an extra package, but it can be safely written using Lua strings as their base internal storage).
Finally "%z" in patterns is just an alias for "\000" (literal notation, that has no meta-meaning by itself in patterns; '$' does have this meaning and matches 0 characters, whereas "%z" or "\000" matches one byte of actual string content, that will be present in the returned match or capture groups).

So I see no reason to use "%z" or '\000" for this case, it would an unstable solution.

Le jeu. 21 janv. 2021 à 20:07, Gé Weijers <ge@weijers.org> a écrit :
On Thu, Jan 21, 2021 at 10:04 AM Philippe Verdy <verdyp@gmail.com> wrote:
>
> and here using %z (really deprecated?) or \000 would be incorrect: you don't want to match a nul character in the source text (which would be invalid anyway anywhere in a _javascript_ source, or URL, or HTML using a single-byte character encoding). To match the end of string use the '$' anchor only.

The frontier pattern uses the null character to match the begin or end
of a string, this is documented.

--