Parsecc
Version 1.0
Help File Edition 2
froztica@yahoo.ca
Contents
Parsecc is an LL(1) compiler-compiler (ie. parser generator) that takes a grammar written using regular expressions. It is based on an earlier program named 'g' by D. Mason. LL(1) means scanning from left to right, producing a leftmost derivation, and using 1 lookahead token. An LL(1) grammar can be used for most languages and real-world applications.
Writing a parser can certainly be done manually, but it quickly becomes a tedious process, especially for complex languages. This program can be used to automatically generate a parser given a grammar that you specify. Some compiler-compiler tools implement the parser using parse tables. Compared to simple procedural parsing, parse tables have greater technical elegance and use less code size. However, parse tables are somewhat more complicated in their implementation.
The program generates a parser in Object Pascal for Delphi 5.
To find out more about compilers, you may wish to read 'Compilers: Principles, Techniques, and Tools' ('the dragon book') by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.
A sample parsing project is included in the 'example' directory. The 'calc' project implements the classic calculator example that uses order of operations properly. Eg, multiply and divide operations are done before add and subtract operations. A sample grammar for this file is found in the 'calc.grm' file. To generate the parser source code for the calc project, you should do the following:
1. Start the Parsecc program.
2. On first-time startup, the program should have the following settings: '...\examples\calc.grm' as the input grammar, '...\examples\parser.pas' as the output file, and '2' nodes per leaf. Confirm that these settings are correct by selecting Project | Options.
3. Compile the grammar by selecting Run | Compile. After successfully executing this step, you should now see a newly created '...\examples\parser.pas' file.
Compiling The Calc Project
Once the 'parser.pas' file has been created, you can compile the calc project. If you have difficulty compiling the grammar, you may substitute 'parser0.pas' for 'parser.pas'.
If you still have difficulty compiling, just run the 'calc.exe' program to see a demonstration.
Running Calc
The sample calc program takes input such as:
x = 3 + 9 * 8
y = 500 + x
show y
When 'run' is pressed, the input is parsed and executed. You should see an appropriate result reported for each show statement. You may wish to try changing the input to see what the result is.
The program takes an input grammar ('yourapp.grm'), parses it, and generates a parser file as output ('yourparser.pas'). The nodes per leaf option indicates the maximum number of sub-nodes each node has. A value of 2 should do for a simple expression calculator, and a value of 4 should do for simple languages. For example, a production for a c-style for statement could look like:
ForStatement -> 'for' '(' (InitializationExpression)? (LoopExpression)? (IncrementExpression)? ')' StatementBlock;
The above production would require 4 sub-nodes (3 for the for-loop expressions, and 1 for the statement block).
The first production must be named 'S'. This stands for the 'start' of the grammar.
It is assumed you are writing a 2-pass compiler. In this case, note that the + (one-or-more) or * (repeat) operators
in your grammar will not be used effectively. They are included for completeness. The ? (decision) operator is
actually adequate for most needs. For example, to express a list of 1 to many statements, simply use:
StatementList -> Statement (StatementList)?;
You can embed code at the beginning of the file, in the middle (after the parse function declarations) and at the end. If you need to write a parse function manually (for functions such as TNumberNode where the node contains a number value), prefix the production expression with '$'. If the '$' is encountered, then code for that production will not be generated. It can be written in manually and embedded in the end section.
When the target application runs, a parse tree is generated using different classes of nodes. Each node
is a different object and has it's own 'CodeGen' behaviour.
Here is the full specification for the input grammar file:
S -> TokenSection GrammarSection CodeSection;
TokenSection -> 'tokens' TokenList;
TokenList -> TokenDeclaration (TokenList)?;
TokenDeclaration -> LiteralTok IdentTok;
GrammarSection -> 'grammar' ProductionList;
ProductionList -> Production (ProductionList)?;
Production -> ProdOrTermName '->' (SuppressCodegen)? ExprList ';';
ExprList -> Expr (ExprList)?;
Expr -> (Symbol | BracketExpr) (EmbeddedCode)?;
BracketExpr -> '(' OrList ')' RegularModifier;
OrList -> ExprList ('|' OrList)?;
RegularModifier -> (Decide | Repeat | OneOrMore)?;
CodeSection -> (TopSection)? (MiddleSection)? (BottomSection)?;
TopSection -> 'top' EmbeddedCode;
MiddleSection -> 'middle' EmbeddedCode;
BottomSection -> 'bottom' EmbeddedCode;
SuppressCodegen -> '$';
Decide -> '?';
Repeat -> '*';
OneOrMore -> '+';
Symbol -> (ProdOrTermName | Terminator);
ProdOrTermName -> IdentTok;
Terminator -> LiteralTok;
EmbeddedCode -> EmbedCodeTok;