Topic 1 - Introduction to Compiling

[1.1] - Compilers

A compiler is a program that reads a program written in one language, the source language, and translates it into an equivalent program in another language, the target language. The translation process should also report the presence of errors in the source program. Compiling has two parts: analysis which breaks up the source program into constituent pieces and creates an intermediate representation and synthesis which constructs the desired target program from the intermediate representation.

[1.3] - Phases Of a Compiler

The typical compiler has a number of phases plus a symbol table manager and an error handler as shown below.

Input Source Program
|
Symbol Table Manager Lexical AnalyzerError Handler
Syntax Analyzer
Semantic Analyzer
Intermediate Code Generator
Code Optimizer
Code Generator
|
Output Target Program

Lexical Analyzer: The lexical analyzer or scanner reads the characters in the source program from left-to-right and groups them into tokens which are sequences of characters that have a collective meaning. For example, the non-space characters in the Pascal source statement:

position := initial + rate * 60

are grouped into 7 tokens as shown in the following table.

identifierassignop identifieraddop identifiermulopnumber
position:=initial+ rate*60

The first row in this table shows the token type of each token and the second row shows the lexeme or character string associated with it.

Symbol Table Manager: Identifiers mark the locations of the names of variables and procedures in the source program: the compiler needs a symbol table to record each identifier and collect information about it. For a variable the symbol table might record its type-expression (integer, real, etc.), its scope (where the variable can be used), and its location in run-time storage, etc. For a procedure the symbol table might record the type-expressions of its arguments and the type-expression of its returned value.

The symbol table manager has a FIND function that returns a pointer to the record for an identifier when given its lexeme: compiler phases use this pointer to read and/or modify information about the identifier. FIND returns a NULL pointer if there is no record for a given lexeme. The INSERT function in the symbol table manager inserts a new record into the symbol table when given its lexeme.

In the example Pascal source statement there are three identifier tokens with the lexemes: position, initial and rate. When the lexical analyzer encounters each lexeme in the source string it uses the FIND function to find its record in the symbol table: if the record is absent then the lexical analyzer uses INSERT to create one. In either case, the pointer into the symbol table is attached to the identifier token so other compiler phases can use it to read and or modify information about the identifier.

For the example Pascal statement the lexical analyzer sends seven tokens to the syntax analyzer:

id1assignopid2addopid3 mulopnum(60)

where id1, id2 and id3 represent identifier tokens with attached pointers to the symbol table entries for position, initial and rate, respectively, and num(60) represents a number token with the integer 60 attached to it.

Syntax Analyzer: The syntax analyzer or parser receives tokens from the lexical analyzer and groups them hierarchically into a tree. For the example Pascal statement it might produce the following syntax tree:

          assignop
         /        \
        /          \
     id1            addop
                   /     \
                  /       \
               id2         mulop
                          /     \
                         /       \
                      id3         num(60)

Semantic Analyzer: The semantic analyzer gathers type information and checks the tree produced by the syntax analyzer for semantic errors. If rate is a real variable in the example Pascal source statement then the semantic analyzer might add a type conversion node, inttoreal, to the syntax tree to convert the integer 60 to a real quantity:

          assignop
         /        \
        /          \
     id1            addop
                   /     \
                  /       \
               id2         mulop
                          /     \
                         /       \
                      id3         inttoreal
                                      |
                                      |
                                   num(60)

Intermediate Code Generator: After semantic analysis many compilers generate an intermediate representation of the source program that is both easy to produce and easy to translate into the target program. There are a variety of forms used for intermediate code. One such form is called three-address code and looks like code for a memory-memory machine where every operation reads operands from memory and writes results into memory. The intermediate code generator usually has to create temporary locations to hold intermediate results. For the example Pascal statement three-address code might look like:

     temp1  :=  inttoreal(60)
     temp2  :=  id3 * temp1
     temp3  :=  id2 + temp2
     id1    :=  temp3

where temp1, temp2 and temp3 are the names of three temporary locations created by the intermediate code generator.

Code Optimizer: The code optimizer examines the intermediate code and modifies the code wherever it can make the code run faster. The intermediate code for the example Pascal statement can be improved in two places: (1) rather than call the inttoreal routine to convert 60 to a real number at run-time the compiler can make the conversion; and (2) rather than compute the sum into temp3 and then copy it to id1, the sum can be written into id1 directly. After code optimization the example code looks like:

     temp2  :=  id3 * 60.0
     id1    :=  id2 + temp2

Code Generator: The code generator generates machine code for the target machine from the optimized intermediate code. If the target machine has registers the target program for the example Pascal statement might look like:

     MOVF     id3,R2
     MULF     #60.0,R2
     MOVF     id2,R1
     ADDF     R2,R1
     MOVF     R1,id1

[1.5] - The Grouping of Phases

Front and Back Ends: The front end of a compiler includes all analysis phases and the intermediate code generator while the back end includes the code optimization and final code generation phases: the front end analyzes the source program and produces intermediate code while the back end synthesizes the target program from the intermediate code.

It's an advantage to use the same intermediate code representation for several different source languages and/or several different target languages: one can then combine any front end with any back end. For example Metrowerks developed a compiler with four front ends to analyze source programs written in Pascal, C, C++, or Java and four back ends to synthesize machine code for a Macintosh with a 68K processor, a Macintosh with a PowerPC processor, a PC using Windows 95, or a PC using Windows NT: it's really 16 different compilers even though only 4 front ends and 4 back ends were written.

This course (CS 4/53111) focuses on the front end of a compiler while a graduate-level course, Advanced Compilers (CS 6/73111), focuses on the back end. The coding projects in this course develop the front end of a compiler for a subset of the Pascal language: the techniques you learn in the coding projects can be easily applied to most any other source language.

A naive approach to the front end of a compiler might run the phases serially:

  1. the whole source program is lexically analyzed to produce a long string of tokens;

  2. the whole string of tokens is then syntactically analyzed to produce a large tree;

  3. the whole tree is then semantically analyzed to produce another tree; and then

  4. that large tree is used to generate intermediate code.

The naive approach requires a large amount of storage to hold the several thousands of tokens and large trees that might occur in a typical source program: the compiler will run very slow since each phase would have to output its result to a temporary disk file to be read by the next phase.

Modern compilers eliminate the storage problem using syntax-directed translation to interleave the actions of the phases. The syntax analyzer directs the whole process; the lexical analyzer is a subroutine that produces just one token each time it is called by the syntax analyzer; and the actions of the semantic analyzer and the intermediate code generator are built inside the syntax analyzer. At any time only a small fraction of the syntax tree is stored inside the computer so even a very large source program can be compiled without the need for temporary disk files.

The coding projects use syntax-directed translation:


Kenneth E. Batcher - 8/10/2001