TSGrep.pas copyright (C) 2000 R. Collins rcoll@rtd.com LEGAL STUFF: This software is provided "as-is". This software comes without warranty or guarantee, explicit or implied. Use this software at your own risk. The author will not be liable for any damage to equipment, data, or information that may result while using this software. By using this software, you agree to the conditions stated above. This software may be used freely, provided the author's name and copyright statement remain a part of the source code. THE CODE: TSGrep is a "simple grep" component that does not rely on an external DLL or library. All functionality is contained within the TSGrep object. All regular-expression parsing, searching, and matching are contained within this one source-code file. Since it does not rely upon an external library, this makes it easier to export your program to other machines without dragging along extra files. However, since it is a simple search and match, perhaps not all the functions you need are implemented. If you have need of a function that is not implemented, and do not know how to include it yourself, e-mail me and I'll see if I can code it for you. Be aware: I reserve the right to charge a fee for my time (I also reserve the right to work for free). In either case, I work cheap. This component was written using Borland's Delphi 3, but should be source-code compatible with later versions of Delphi. I cannot state if the code will compile on Delphi 1 or Delphi 2. However, since you have the source code, you should be able to modify it to work on the earlier versions of Delphi. It is written with the understanding that other programmers will need to read it, and possibly modify it for their own specific needs. Comments are meant to be descriptive and informative; however, I am assuming an experienced software engineer level of experience. This code is not written for beginners. The TSGrep component is a descendant of TObject, which means it is a top-level object. It will have the basic behavior common to all objects in Delphi. KNOWN PROBLEMS and THINGS TO DO: The parsing and sorting routines are rather slow and inefficient. They are fine for short and simple searching, but for longer or complicated regular expressions, the parse and sort to build the search tree could be better. However, once the search tree has been build, the actual searching is pretty quick. There is currently no way to save a search tree (after parsing the regular expression) to a file. Saving a search tree that is used often or across several applications would considerably speed up initial parsing and sorting. Complicated regular expressions can result in a lot of "dead code" in the search tree. This is just wasted memory. Currently, there is no "garbage collection" procedure to return this memory to the system until the entire TSGrep object is released (see TSGrep.Free) and all memory is returned to the system. The TSGrep object only searches through strings. It would be good to have it also search through files or streams. REGULAR EXPRESSIONS Almost everyone is familiar with the famous MS-DOS style of searching for files *.TXT, or *.*. These are trivial regular expressions. In MS-DOS, the "*" character matches any number of any characters; the "?" matches any single character. A regular expression search engine simply expands on this concept. Various characters or strings are used to match different types of targets; in addition, the use of boolean operators AND "&" and OR "|" allow multiple possibilities to be found within the source string. When searching a source string for targets, there are two values that define the current search position: the ANCHOR and the BEAD. The anchor is defined to be the starting position of the search; i.e., this value 'anchors' the pattern matching to a specific starting point in the source string. By default all positions in the source string are tested (anchored) until the first match is found. Setting a value to the anchor insures that only targets found at that position are matched. The bead (sometimes called the cursor) is the current point in the source string where a match is declared a success. Taken together, the string slice SOURCE[ANCHOR..BEAD] define the substring where a matching target is found. Following are the pattern matching operators used by SGrep: ? question mark Matches one character of anything (MS-DOS style). * asterik Matches any number of anything (MS-DOS style). "xxx" double-quotes Matches literal characters inside the quotes. \000 back-slash Match literal character; if the number starts with 0, then the numeric code is octal. \999 back-slash Match literal character; if the number starts with 1-9, then the numeric code is in decimal. \xHH back-slash Match literal character; if the digit string starts with a lower-case "x", then the next 2 characters are hex digits. \n Match a literal line-feed. \r Match a literal carraige-return. \t Match a literal tab character. \f Match a literal form-feed character. [xxx] square brackets Match any single character within the brackets. You can use the dash "-" to span characters in sequence. For example, [1-9] is the same as [123456789]. [-xxx] square brackets negated Match any single character NOT within the brackets. If the first character after the opening "[" is a dash "-", then the character set is negated; i.e., any character not within the set is matched. You can still use a dash to span sequential characters; [-1-9] is the same as [-123456789], and will match any character that is not a digit 1 through 9. The set [-a-zA-Z] is the same as [-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] and will match any character that is not an alphabetic character. %x percent sign Special character span; where "x" specifies the type of spanning. %W match any number of white space characters (space or tab) %w match a single white space %N match any number of digits "0123456789" %n match any single digit "0123456789" %A match any number of alphabetics "a-z" or "A-Z" %a match any single alphabetic "a-z" or "A-Z" %X match any number of alphanumerics "0-9", "a-z" or "A-Z" %x match any single alphanumeric "0-9", "a-z", or "A-Z" %-x percent sign negated Special character span; where "x" specifies the type of spanning. In this case, the "-" negates the span, causing the pattern to match a target that does NOT consist of the span set. %W match any number characters that are NOT white space (space or tab) %w match a single character that is NOT white space %N match any number of characters that are NOT digits "0123456789" %n match any single character that is NOT a digit "0123456789" %A match any number of characters that are NOT "a-z" or "A-Z" %a match any single character that is NOY "a-z" or "A-Z" %X match any number of characters that are NOT "0-9", "a-z" or "A-Z" %x match any single character that is NOT "0-9", "a-z", or "A-Z" @nnn at Match bead position. The "nnn" is a decimal string specifying a position within the source string. The match succeeds if the current bead is at the given position. The source bead positions start at 0 (for beginning of line) and extend to the length of the string (for end of line matching). Specifying a bead position matches targets only if they start at that bead position. For example, the pattern @0 "Cat" matches the characters "Cat" only if they appear at the beginning of the source string. @-nnn at negated The bead position, counted from the end of the string. For example, "Cat" @-0 matches the character "Cat" only if they appear at the end of the source string. & ampersand Boolean operator AND. This forces the two expressions on either side of the & to be evaluated, and both must be TRUE to succeed. This operator is not normally needed, since a logical AND is the default conjunction of a series of expressions. | bar Boolean operator OR. Allows either of the two expressions on either side to be evaluated. If either is TRUE, then the match succeeds. For example "Cat" | "Dog" will match either the characters "Cat" or "Dog". The left expression is evaluated first; if it succeeds, then the right expression is skipped; if the left expression fails, then the right-side expression is tested. (xxx) parentheses Group expressions into a hierarchy. For example (%n%n%n "-" %n%n%n%n) | "Cat" will match either a string of characters that look like a phone number (123-4567), or the characters "Cat", whichever it finds first. {min:max} curly braces Defines a repeat count for the next expression. The next expression must be matched a minimum of "min" times, and no more than "max" times. If only one number is given, (for example {5}) then the next expression will be matched an exact number of times. For example {3} ("Cat" | "Dog") will match exactly 3 occurrences of either "Cat" or "Dog". !x shrike Writes output via the OnWrite event. The "x" decides what value is to be written: !A writes the current value of the anchor as a string !B writes the current value of the bead as a string !S writes the current matching substring (indexed anchor to bead) !T writes the current matching total string (indexed 0 to bead) See the OnWrite event for more information. INTERFACE constructor Create; Creates a new (empty) grep component. constructor CreateFromPattern(const Pattern: string); Creates a new grep component, and parses the pattern into a search tree. Once parsed, the pattern may be used multiple times with multiple source strings. The overhead of parsing the pattern is performed only once. procedure Free; Releases the memory used by a grep component, and returns all resources used by the grep component back to the system. This grep component may not be used again until it is re-created by a constructor call. function MatchFirst(Source: string): integer; Searches the source string for the first instance of the pattern, and returns the index of the match. The index of a string is counted as 1..n, where n is the length of the string. If the pattern is not found, 0 is returned. function MatchNext: integer; Used after a call to MatchFirst, returns the index of the next matching occurrence. If the pattern is not found, 0 is returned. function Execute(Patter, Source: string): boolean; Parses the pattern and initializes the search tree, then searches the source string for any matching pattern. Returns TRUE if the pattern is found, or FALSE if it is not found. procedure Dump_OpList(Strings: TStrings); Used for debugging purposes; writes a formatted copy of the search tree to the string list. property Tag: integer; (read write) A general purpose integer for use by the programmer. Typically used to uniquely identify a component. property Pattern: string; (read write) Defines the current pattern to search for. When set, the pattern is parsed and sorted into a search tree. The overhead of parsing and sorting is performed only once, although the pattern may be used to search several different source strings. property Source: string; (read write) Defines the string to be searched. No parsing or sorting is performed on the source string; it is simply scanned, character by character, to find a matching pattern. property Success: boolean; (read only) Returns the success TRUE or FALSE of the last search performed by MatchFirst, MatchNext, or Execute. When read within an event handler, returns the current state of the search, even though the search may not yet be finished. property Anchor: integer; (read write) Defines the beginning index in a source string from which to start a search. If a search is already started, it returns the current index of a source string that is being tested for a pattern match. property Bead: integer; (read only) During a pattern search, returns the current index of the character to be matched against the pattern. The bead is always greater than or equal to the anchor. property SubString: string; (read only) Returns the string that matched the pattern. When used in an event handler, the SubString is the current incomplete matching string. property OnError: TGrepEvent; (read write) Defines the procedure to call when an error occurs. Most errors occur during the parsing or sorting phase, and represent an invalid operator syntax. The OnError event provides a character string containing a diagnostic of the error. TgrepEvent = procedure(Sender: TObject; Message: string); property OnWrite: TGrepEvent; (read write) Provides an event for the write operations (!S, !T, !A, !B). The string passed in the event handler is the string written by the operator. TgrepEvent = procedure(Sender: TObject; Message: string);