lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Currently, the patterns (in the standard string library) are mixing classes that have a single member character, from other classes.

It would be convenient to separate them in two separate categories:

----
Singleton:

A singleton is used to represent a single character that must be matched exactly. The following singletons are allowed:


Character class: 

A character class is used to represent a set of characters. A character class matches a single character in the input text if and only if it is any one of the characters represented in this set. The following combinations are allowed in describing a character class:
The interaction between classes and ranges is not defined. Therefore, patterns like [%a-z] or [a-%d] have no meaning. However any singleton is allowed in ranges, so [%%-a] has a meaning.

However, which characters are included in ranges depends on the internal encoding of characters and their relative order.

For all classes represented by single letters (%a%c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.

The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %lIn some locales whose encoding is not ASCII-compatible the ranges may include unexpected characters. For example the class [a-z] is not warrantied to include only lower letters a to z, if the encoding is any variant of EBCDIC.

----

What is the difference ?
  • [%%-a] or [%z-%%] have now a meaning (just like they already have in existing implementations).
  • But there's another statement explaining that ranges are locale-dependant (e.g. ASCII versus EBCDIC), including for ranges of basic letters a-z or A-Z.(both ranges are included in class %l which may include other letters).
  • However I don't know in which encoding (or locale) this is not true for the range of basic decimal digits 0-9 (which is itself included in the class %d possibly containing other digits in some locales), so sets defined above for basic octal digits (using the range 0-7) should be portable across all locales and encodings.