All Pages All Books|
|
|||||
|
2. Lex
|
|||||
|
|
|||||
|
2.1 Theory
The first phase in a compiler reads the input source and converts strings in the source to tokens. Using regular expressions, we can specify patterns to lex that allow it to scan and match strings in the input. Each pattern in lex has an associated action. Typically an action returns a token, representing the matched string, for subsequent use by the parser. To begin with, however, we will simply print the matched string rather than return a token value. We may scan for identifiers using the regular expression
letter(letter|digit)*
This pattern matches a string of characters that begins with a single letter, and is followed by zero or more letters or digits. This example nicely illustrates operations allowed in regular expressions:
• repetition, expressed by the “*” operator
• alternation, expressed by the “|” operator
• concatenation
Any regular expression expressions may be expressed as a finite state automaton (FSA). We can represent an FSA using states, and transitions between states. There is one start state, and one or more final or accepting states.
letter or digit
|
|||||
|
|
|||||
|
|
![]() |
|
|||
|
|
|||||
|
start ^-^ letter L—3Jf other
|
/^^
|
||||
|
|
|||||
|
Figure 2-1: Finite State Automaton
In Figure 2-1, state 0 is the start state, and state 2 is the accepting state. As characters are read, we make a transition from one state to another. When the first letter is read, we transition to state 1. We remain in state 1 as more letters or digits are read. When we read a character other than a letter or digit, we transition to state 2, the accepting state. Any FSA may be expressed as a computer program. For example, our 3-state machine is easily programmed:
|
|||||
|
|
|||||
|
6
|
|||||
|
|
|||||
All Pages All Books