Currently, the patterns (in the standard string library) are mixing classes that have a single member character, from other classes.
----
Singleton:A
singleton is used to represent a single character that must be matched exactly. The following singletons are allowed:
- x: (where
x
is not one of the magic characters in the string '^$()%.[]*+-?') represents the character
x
itself.
- %x: (where
x
is any non-alphanumeric character) represents the character
x. This is the standard way to escape the magic characters. Any punctuation character (even the non magic) can be preceded by a '%' when used to represent itself in a pattern.
- %z: represents the character with representation 0.
Character class:
A
character class is used to represent a set of characters. A
character class matches a single character in the input text if and only if it is any one of the characters represented in this set. The following combinations are allowed in describing a
character class:
- .: (a dot) represents all characters.
- %a: represents all letters.
- %c: represents all control characters.
- %d: represents all digits.
- %l: represents all lowercase letters.
- %p: represents all punctuation characters.
- %s: represents all space characters.
- %u: represents all uppercase letters.
- %w: represents all alphanumeric characters.
- %x: represents all hexadecimal digits.
- [set]: represents the class which is the union of all characters in set.
Singletons in set represent themselves.
Character classes described above can also be used as components in set, with the exception of . (the dot) which represents itself inside a set like other singletons.
For example, [%w_] (or [_%w]) represents all alphanumeric characters plus the underscore.
A range of characters can be specified by separating the end characters of the range with a '-'.
For example, [0-7] represents the octal digits.
Singletons (including the dot), character classes %x and ranges can be mixed in the same set. For example [0-7%l%-] represents the octal digits plus the lowercase letters plus the '-' character.
- [^set]: represents the complement of set, where set is interpreted as above.
The interaction between
classes and ranges is not defined. Therefore, patterns like [%a-z] or [a-%d] have no meaning. However any singleton is allowed in ranges, so [%%-a] has a meaning.
However, which characters are included in ranges depends on the internal encoding of characters and their relative order.
For all classes represented by single letters (%a
, %c
, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S
represents all non-space characters.
The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z]
may not be equivalent to %l
. In some locales whose encoding is not ASCII-compatible the ranges may include unexpected characters. For example the class [a-z] is not warrantied to include only lower letters a to z, if the encoding is any variant of EBCDIC.
----
What is the difference ?
- [%%-a] or [%z-%%] have now a meaning (just like they already have in existing implementations).
- But there's another statement explaining that ranges are locale-dependant (e.g. ASCII versus EBCDIC), including for ranges of basic letters a-z or A-Z.(both ranges are included in class %l which may include other letters).
- However I don't know in which encoding (or locale) this is not true for the range of basic decimal digits 0-9 (which is itself included in the class %d possibly containing other digits in some locales), so sets defined above for basic octal digits (using the range 0-7) should be portable across all locales and encodings.