04-07-2012, 11:36 AM
Seminar on Lexical Analysis
Lexical Analysis.pdf (Size: 177.44 KB / Downloads: 67)
The Basics
Lexical analysis or scanning is the process where the stream of characters making up the
source program is read from left-to-right and grouped into tokens. Tokens are sequences
of characters with a collective meaning. There are usually only a small number of tokens
for a programming language: constants (integer, double, char, string, etc.), operators
(arithmetic, relational, logical), punctuation, and reserved words.
Scanner Implementation 1: Loop and Switch
There are two primary methods for implementing a scanner. The first is a program that
is hard-coded to perform the scanning tasks. The second uses regular expression and
finite automata theory to model the scanning process.
A "loop & switch" implementation consists of a main loop that reads characters one by
one from the input file and uses a switch statement to process the character(s) just read.
The output is a list of tokens and lexemes from the source program. The following
program fragment shows a skeletal implementation of a simple loop and switch scanner.
The main program calls InitScanner and loops calling ScanOneToken until EOF.
ScanOneToken reads the next character from the file and switches off that char to decide
how to handle what is coming up next in the file. The return values from the scanner
can be passed on to the parser in the next phase.
scanner generator
The reason we have spent so much time looking at how to go from regular expressions
to finite automata is because this is exactly the process that lex goes through in creating
a scanner. lex is a lexical analysis generator that takes as input a series of regular
expressions and builds a finite automaton and a driver program for it in C through the
mechanical steps shown above. Theory in practice!