Nre Package Commands


NAME

nrematch - Match a regular expression against a string

SYNOPSIS

package require nre ?2.0?
nrematch ?switches? exp string ?matchVar? ?subMatchVar subMatchVar ...?

DESCRIPTION

Determines whether the regular expression exp matches part or all of string and returns 1 if it does, 0 if it doesn't.

If additional arguments are specified after string then they are treated as the names of variables in which to return information about which part(s) of string matched exp. MatchVar will be set to the range of string that matched all of exp. The first subMatchVar will contain the characters in string that matched the leftmost parenthesized subexpression within exp, the next subMatchVar will contain the characters that matched the next parenthesized subexpression to the right in exp, and so on.Instead of using the standard regular expression package it uses the package described in this man page.

If the initial arguments to nrematch start with - then they are treated as switches. The following switches are currently supported:

-nocase
Causes upper-case characters in string to be treated as lower case during the matching process.

-indices
Changes what is stored in the subMatchVars. Instead of storing the matching characters from string, each variable will contain a list of two decimal strings giving the indices in string of the first and last characters in the matching range of characters.

-allInstead of returning after a single match all ranges in string that match exp are found. Returns the number of matches found. The matchVar and subMatchVars are set to an empty list and as each match is found an element is appended to the var's list. If the -indices switch is used then two elements are appended to each list for each match found.

--
Marks the end of switches. The argument following this one will be treated as exp even if it starts with a -.

If there are more subMatchVar's than parenthesized subexpressions within exp, or if a particular subexpression in exp doesn't match the string (e.g. because it was in a portion of the expression that wasn't matched), then the corresponding subMatchVar will be set to ``-1 -1'' if -indices has been specified or to an empty string otherwise.

REGULAR EXPRESSIONS

Regular expressions are implemented using Henry Spencer's package (thanks, Henry!), and much of the description of regular expressions below is copied verbatim from his manual entry.

A regular expression is zero or more branches, separated by ``|''. It matches anything that matches one of the branches.

A branch is zero or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.

A piece is an atom possibly followed by ``*'', ``+'', ``?'',or ``{x,y}'' which in turn might be followed by a ``?''.A ``*'' matches a sequence of 0 or more matches of the atom. A ``+'' matches a sequence of 1 or more matches of the atom. A ``?'' matches a sequence of 0 or 1 matches of the atom.A ``{x}'' matches a sequence of x matches of the atom. A``{x,}'' matches a sequence of x or more matches of the atom. A ``{x,y}'' matches a sequence of at least x and at most y matches of the atom. By default a piece will match as long a sequence as possible. However if the piece constructs described above have a ``?'' after them then piece will match as short a sequence as possible.

Note that the ``{x,y}'' repetition construct is only recognized if the p flag is set.

An atom is a regular expression in parentheses (matching a match for the regular expression), a range (see below), ``.'' (matching any single character), ``^'' (matching the null string at the beginning of the input string), ``$'' (matching the null string at the end of the input string), a ``\'' followed by a single character (matching that characteror matching something special if the p flag is used; see the FLAGS section for details),or a single character with no other significance (matching that character).

A range is a sequence of characters enclosed in ``[]''. It normally matches any single character from the sequence. If the sequence begins with ``^'', it matches any single character not from the rest of the sequence. If two characters in the sequence are separated by ``-'', this is shorthand for the full list of ASCII characters between them (e.g. ``[0-9]'' matches any decimal digit). To include a literal ``]'' in the sequence, make it the first character (following a possible ``^''). To include a literal ``-'', make it the first or last character.

A parentheses atom in which the character immediately after the ``('' is a ``?'' is a special construct with one of the following meanings:

``(?:''regexp``)'' are shy groups. This groups like ``()'' but doesn't capture the text for backreferences like ``()'' does. It matches if regexp matches.

``(?=''regexp``)'' is a non-capturing zero-width positive lookahead assertion. It matches if regexp matches. The matched text is not consumed.

``(?!''regexp``)'' is a non-capturing zero-width negative lookahead assertion. It matches if regexp does not match.

``(?#''any text``)'' is a comment. The entire atom is treated as an empty string.

``(?ipxm)'' is a used to set flags. Any combination of the flag characters ``ipxm'' are allowed. The entire atom is treated as an empty string. See the FLAGS section for a description of each flag.

``(?|''range``)'' is an alternate syntax for a character range. Its benefit is that it does not use the Tcl special characters ``[]'' to enclose the range.

FLAGS

Flags can be set using a ``(?''flag-char``)'' atom. Some commands that use regular expressions have options that set some of these same flags. For example the -nocase option sets the i flag. The advantage of having the flags in the regular expression itself is that they can then be used by any command without the need to add new command switches. It is best to set the flags at the very beginning of the regular expression; however they apply to the entire regular expression no matter where they appear.

The i flag causes case to be ignored when alphabetic characters are compared.

The m flag enables multi-line mode. The ``^'' atom is changed to match at the beginning of the string or the beginning of any line in the string. The ``$'' atom is changed to match at the end of the string or the end of any line in the string. The ``.'' atom is changed to match any character except ``\n''.

The x flag causes white space in the regular expression to be ignored and removed during compilation. To include literal white space as an atom to be matched preceed it with a backslash ``\''. Whitespace is only ignored between atoms, pieces, branches, and regular expressions. It is not ignored in ranges or in any other complex atom. The white space includes comments where a comment starts with a ``#'' and continues to the end of the line.

The p flag enables extra escape sequences and constructs to be recognized. See the BACKWARDS COMPATIBILITY section for why these constructs are not enabled by default. The following are enabled:

\w
Match a "word" character (alphanumeric plus "_"). The alphanumeric characters are determined using isalnum(3) which can vary depending on the locale.

\W
Match a non-word character.

\s
Match a whitespace character. The whitespace characters are determined using isspace(3) which can vary depending on the locale.

\S
Match a non-whitespace character.

\d
Match a digit character. The digit characters are determined using isdigit(3) which can vary depending on the locale.

\D
Match a non-digit character.

\b
Zero-width assertion matches a word boundary. Current character matches \w and previous character matches \W or current character matches \W and previous character matches \w. The position before the first character in the string and after the last character match \W.

\B
Zero-width assertion matches a non-word boundary. Current character matches \w and previous character matches \w or current character matches \W and previous character matches \W. The position before the first character in the string and after the last character match \W.

\<
Zero-width assertion matches start of word. Current character matches \w and previous character matches \W. The position before the first character in the string and after the last character match \W.

\>
Zero-width assertion matches end of word. Current character matches \W and previous character matches \w. The position before the first character in the string and after the last character match \W.

\A
Zero-width assertion matches only at beginning of string even if m flag.

\Z
Zero-width assertion matches only at end of string even if m flag.

\G
Zero-width assertion matches only where previous -all match left off.

\Q
Quote mode. All characters following are treated as literal text until a \E or the end of the regular expression.

\E
End quote mode.

\num
Backreference to the num'th captured substring. The value of num must not be greater than the number of captured substrings to the left of the backreference. The text from the backreference is inserted into the regular expression and is always treated as literal text.

\meta
If the ``\'' is followed by a regular expression meta character then the meta character is treated as literal text. The meta chararacters are: ``\*+?()|[]{}^$''. If ``\'' is followed by anything else the regexp compiler will raise an error.

{x,y}
This piece construct is a repetition operator and is described above in the piece paragraph.

CHOOSING AMONG ALTERNATIVE MATCHES

In general there may be more than one way to match a regular expression to an input string. For example, consider the command
nrematch  (a*)b*  aabaaabb  x  y
Considering only the rules given so far, x and y could end up with the values aabb and aa, aaab and aaa, ab and a, or any of several other combinations. To resolve this potential ambiguity nrematch chooses among alternatives using the following rules apply in decreasing order of priority:

  1. If a regular expression could match two different parts of an input string then it will match the one that begins earliest.

  2. If a regular expression contains | operators then the leftmost matching sub-expression is chosen.

  3. In *, +, ?, and{x,y}constructs, longer matches are chosen in preference to shorter ones. These operators are often called greedy because they match the longest possible string that allows the entire regular expression to match.In *?, +?, ??, and {x,y}? constructs, shorter matches are chosen in preference to longer ones. These operators are often called lazy because they match the shortest possible string that allows the entire regular expression to match.

  4. In sequences of expression components the components are considered from left to right.

In the example from above, (a*)b* matches aab: the (a*) portion of the pattern is matched first and it consumes the leading aa; then the b* portion of the pattern consumes the next b. Or, consider the following example:
nrematch  (ab|a)(b*)c  abc  x  y  z
After this command x will be abc, y will be ab, and z will be an empty string. Rule 4 specifies that (ab|a) gets first shot at the input string and Rule 2 specifies that the ab sub-expression is checked before the a sub-expression. Thus the b has already been claimed before the (b*) component is checked and (b*) must match an empty string.

LIMITS

The maximum number of capturing subexpressions ``()'' in a single regular expression is 255. This limit does not apply to the non-capturing ``(?:)''.

A compiled regular expression is limited in size to 32678 bytes. If during compilation it is discovered that the regular expression requires more memory then the operation will fail with the error: ``regexp too big''.

The counts in the repetition construct ``{x,y}'' must be greater than or equal to zero and less than or equal to 255.

The maximum number of unique ranges in a regular expression is 64.

BACKWARDS COMPATIBILITY

Regular expressions from previous releases of Tcl should behave exactly the same. The following new constructs:

(?...), *?, +?, and ??

will cause compilation errors in older regular expressions so they are always recognized in new regular expressions.

All the other new constructs would have meant something else in older regular expressions. So they always have the old meaning unless you turn on one of the new flags. For example you need to start a regular expression with (?p) if you want to use the new ``\'' sequences or the ``{x,y}'' repetition construct.

PERFORMANCE INFORMATION

The first time a regular expressions is used it is compiled into a Tcl object. The next time that object needs to be used as a regular expression the compilation step will not be needed if the object still exists and is still a regular expression. So if the regular expression is a constant string:
nrematch {abc|def|zeq} $str
then the first time the above command is executed the string constant object is converted to a regular expression object and will remain so giving a performance boost.

However if the regular expression string is not constant:

nrematch "$W1|$W2|$W3" $str
then the string object will need to be recreated each time the above command executes.

If instead you stored the regular expression string into a variable then the regular expression object would remain and not need to be recreated each time:

set re "$W1|$W2|$W3"
proc foo {} {
    global re
    nrematch $re $str
}
If it is a complex regular expression used in more than one place this can be a win in both time and space.

It is best to use (?i) instead of -nocase if you can because then the text of the regular expression object describes its state.

If you do not need the matchVar or a subMatchVar then you can set that argument to an empty string ``{}''. This tells nrematch to not bother setting a variable to that particular captured subexpression.

BINARY CLEAN

The new regular expression compiler and matcher are binary clean. This means that it is ok for the regular expression and the string being matched to contain binary data including null bytes.

EXAMPLES

To match a number if not followed by a period:
nrematch {[0-9]+(?![.])} $str

To match a number if followed by something other than a period:

nrematch {[0-9]+(?=[^.])} $str

To match an item that contains only letters, but not all uppercase:

nrematch {^(?![A-Z]*$)[a-zA-Z]*$} $str

To see if a string contains both 'this' and 'that':

nrematch {^(?=.*?this)(?=.*?that)} $str

KEYWORDS

match, nre, regular expression, string

Last change: 2.0

[ nre2.0 ]

Copyright © 1997 Darrel Schneider.