01-06-2013, 03:24 PM
A COMPACT GUIDE TO LEX & YACC
A COMPACT GUIDE.pdf (Size: 126.8 KB / Downloads: 18)
Introduction
Before 1975 writing a compiler was a very time-consuming process. Then Lesk [1975] and Johnson [1975] published papers on lex and yacc. These utilities greatly simplify compiler writing. Implementation details for lex and yacc may be found in Aho [1986]. Lex and yacc are available from
The version from MKS is a high-quality commercial product that retails for about $500US. GNU software is free and includes flex and bison that are clonse for lex and yacc. Output from flex may be used in a commercial product, and, as of version 1.24, the same is true for bison. Cygwin is a 32-bit Windows ports of the GNU software. In fact Cygwin is a port of the Unix operating system to Windows and comes with compilers gcc and g++. To install simply download and run the setup executable. Under devel install bison, flex, gcc-g++, and make. Under utilities install diffutils. Under editors install vim. Lately I've been using flex and bison under the cygwin environment.
Lex
Theory
During the first phase the compiler reads the input and converts strings in the source to tokens. With regular expressions we can specify patterns to lex so it can generate code that will allow it to scan and match strings in the input. Each pattern specified in the input to lex has an associated action. Typically an action returns a token that represents the matched string for subsequent use by the parser. Initially we will simply print the matched string rather than return a token value.
The following represents a simple pattern, composed of a regular expression, that scans for identifiers. Lex will read this pattern and produce C code for a lexical analyzer that scans for identifiers.
Yacc
Theory
Grammars for yacc are described using a variant of Backus Naur Form (BNF). This technique, pioneered by John Backus and Peter Naur, was used to describe ALGOL60. A BNF grammar can be used to express context-free languages. Most constructs in modern programming languages can be represented in BNF. For example, the grammar for an expression that multiplies and adds numbers is
E -> E + E
E -> E * E
E -> id
Three productions have been specified. Terms that appear on the left-hand side (lhs) of a production, such as E (expression) are nonterminals. Terms such as id (identifier) are terminals (tokens returned by lex) and only appear on the right-hand side (rhs) of a production. This grammar specifies that an expression may be the sum of two expressions.