Creating a compiler for a subset of Pascal in Rascal - Syntax

Introduction

When designing a language, your very first step is to define what constitutes a syntactically valid program, in your language. For most, this involves writing a context-free grammar for this purpose and then writing a program that accepts / rejects input files by whether they respect the grammar.

Furthermore, if your program is a compiler, it will also need to create an internal representation of the program. There are actually a few of these, but the very first one that a compiler will create is the Concrete Syntax Tree (CST) of your input file. This is the representation of the exact text in your file (sometimes modulo whitespace / comments).

Creating the CST is the goal of today's article, so buckle in! You can follow the project on my git server and you can browse the source at this specific point in the project here.

Do note, I'm going to assume you know how to read a context-free grammar. It's relatively easy to understand via demonstration, and Wikipedia has some nice examples.

The Syntax

I'm going to discuss each aspect of the syntax of MiniPas, but if you have a bit of experience with EBNF, it might be faster to just read the syntax in full and come back to this section if there is any confusing components. You can find it here, in all it's fully commented glory.

The Top Level Sturcture

The starting rule in our syntax is as follows:

start syntax Program = 'program' Id ";" ConstDecl* VarDecl* SubProgDecl* CompoundStatement ".";

This is our starting rule (as indicated by the start keyword) and it says that a valid MiniPas program:

Starts with the name of the program (as captured by Id)
Continues with constant declarations (as captured by ConstDecl*)
Continues with variable declarations (as captured by VarDecl*)
Continues with sub-program declarations (as captured by SubProgDecl*)
And finishes with the main body of the program (as captured by CompoundStatement)

A few notes:

We use the term "sub-program", instead of functions as MiniPas has two notions of function:
- Functions, which take one or more arguments and return a value.
- Procedures, which take zero or more arguments and do not return a value.
If you're unfamiliar with EBNF, if you have a non-terminal A, then A* indicates that A can match zero or more strings that would be captured by A. For completeness, A+ allows one or more matches and A? allows zero or one matches.
Observe the strings between '' and "". These are terminals within our program and define syntactic elements. So a proper MiniPas program would start with:
```
program name;
```
Note that strings inside '' are case insensitive, whereas strings inside "" are not. So the following is exactly equivalent to the above:
```
PrOgRaM name;
```

Data Declarations

Our syntax allows for constants and variables:

syntax ConstDecl = 'const' Id "=" Number ";";
syntax VarDecl =  'var' IdList ":" TypeSpec ";"

with (potentially arrays of) integer and real numbered types:

syntax TypeSpec = BasicType | 'array' "[" Number ".." Number "]" 'of' BasicType;
syntax BasicType = 'integer' | 'real';

Observe that variable declarations allow a list of variables to be declared at once, and our arrays:

Can have index ranges that are not 0-indexed
Have sizes that are completely known at compile time

Further, observe that this syntax allows indexing by real numbers. We actually do not want this! We could define another syntactic (technically, lexical) element for indexing ... but this is not ideal. We will see how we can deal with this, in a future post on semantic analysis.

Sub-Programs

Our syntax allows for functions / procedures:

syntax SubProgDecl  = 'function' Id Parameters ":" BasicType ";" VarDecl* CompoundStatement ";" 
                    | 'procedure' Id Parameters? ";" VarDecl* CompoundStatement ";";

This broadly matches our description of functions / procedures from earlier. The CompoundStatement defines the body of the function, and you are allowed to define local variables before the body begins via VarDecl*.

Parameters are declared as follows:

syntax Parameters     = "(" ParameterList ")";
syntax ParameterList  = ParamList (";" ParamList)*;
syntax ParamList      = 'var'? IdList ":" TypeSpec;

As you can see, we can declare multiple parameters in a single declaration (ParamList), the collection of which (ParameterList) defines all our parameters. Perhaps more curiously, what is the 'var' doing here? Well, this is how MiniPas declares reference variables! Otherwise, arguments are passed by value.

Now, the Pacalers amongst you might have noticed an incongruency with regular Pascal here. You can declare the size of arrays in MiniPas, as part of the parameter type definition. There are only static arrays in MiniPas and arrays passed as a parameter in MiniPas must fit the size declared in the parameter.

However, the indexing range for arrays need not actually match that declared in the parameters. They are essentially "re-mapped", when used within the function. So you can pass a 0-indexed array (say a) into a function expecting a 1-indexed array, and inside the function a[1] refers to a[0] in the outer scope.

Statements

We can define (a sequence) of statements as follows:

syntax CompoundStatement  = 'begin' StatementList? 'end';
syntax StatementList      = Statement (";" Statement)*;
syntax Statement          = Lhs ":=" Expr
                          | SubProgCall                                
                          | CompoundStatement                             
                          | 'skip'
                          | 'if' Guard 'then' Statement 'else' Statement 
                          | 'while' Guard 'do' Statement
                          ;
syntax SubProgCall        = 'readln'"(" LhsList ")"
                          | 'writeln'"(" ExprList ")"
                          > Id ("(" ExprList ")")?
                          ;

And we have statements for (and I'm listing them in order of the definition above):

Assignments
Sub-Program calls
Nested statements
A no-operation
Conditional statements
Looping statements

A keen reader will observe that skip is a redundant statement, that can be replaced with an empty CompoundStatement. Also, one might wonder why not use StatementList instead of CompoundStatement.

Well, the former is actually an artifact of the Compiler Construction course, to allow empty bodies, without starting a new CompoundStatement for that. The latter is more interesting. It's actually to prevent ambiguities!

Suppose we used StatementList instead. Then we can match x := 10 in two ways:

Statement -> StatementList -> Statement (";" Statement)* 
  -> Statement -> Lhs ":=" Expr -> x ":=" Expr -> x ":=" 10
Statement -> Lhs ":=" Expr -> x ":=" Expr -> x ":=" 10

This gives rise to multiple parse trees, thus making our grammar ambiguous!

Also, really quickly, for completeness, we define Lhs, LhsList, which just restricts the expressions we can assign to (i.e: just variables and specific locations in an array):

syntax LhsList  = Lhs ("," Lhs)*;
syntax Lhs      = Id ("[" Expr "]")?;

Expressions

Finally, we present the syntax for expressions. Note that since we don't have Booleans are a type, Guard expressions have their own syntax.

syntax Guard    = "(" Guard ")"
                > Expr Relop Expr
                > 'not' Guard
                > left (Guard 'or' Guard  | Guard 'and' Guard)
                ;

Our expressions contain the standard arithmetic operations (as well as integer division / modulo operations).

syntax ExprList = Expr ("," Expr)*;
syntax Expr     = "(" Expr ")"                    
                | Id
                | Id "[" Expr (".." Expr)?"]"
                > Number                          
                | Id "(" ExprList ")"             
                > "-" Expr
                > left (
                    Expr  "*" Expr | Expr  "/" Expr 
                  | Expr  'div' Expr | Expr  'mod' Expr
                )
                > left (Expr  "+" Expr | Expr  "-" Expr)
                ;

Perhaps the most interesting line is Id "[" Expr (".." Expr)?"]". It allows for the for the following expressions:

Id: This is simple variable referencing
Id [Expr]: This is simple array indexing
Id [Expr .. Expr]: Now this is interesting! This is array slicing. It defines a sub-array of an array.

Note that our semantics will only allow operations upon basic types (and not array types), so slicing will only be useful for passing slices into functions.

Now, if you recall, arithmetic expressions have a notion of priorities. This is expressly encoded with the use of >. Furthermore, notions of associativity are encoded with the left keyword, which forces all alternatives within the parenthesis to match strings in a left associative manner.

Tokenizing

Tokenizing in Rascal is actually quite different in Rascal (at least to what I've seen). Usually, tokenizing involves using a lexer to scan through text and collect them into tokens, which is the atom of detail in which our parser works at.

As a brief example, if you had the following:

program my_program;

your lexer would probably return the following tokens: PROGRAM, ID, SEMICOLON. In this way your parser does not need to deal with the notion of characters at all.

However, there is no such phase in Rascal (at least in a user-facing manner. Perhaps in the backend, it does essentially create a lexer). Rather, they are simply declared, akin to syntactic elements:

keyword Reserved  = 'program' | 'const' | 'var' | 'array' | 'of' | 'integer' 
| 'real' | 'function' | 'procedure' | 'begin' | 'end'
| 'skip' | 'if' | 'then' | 'else' | 'while' | 'do' | 'not'
| 'or' | 'and' | 'readln' | 'writeln'
;
lexical Id 
  = ([a-z A-Z 0-9 _] !<< [a-z A-Z][a-z A-Z 0-9 _]*
  !>> [a-z A-Z 0-9 _]) \ Reserved;

The above says that an Id is a string starting with an alphabetic character ([a-zA-Z]), followed by any number of alpha-numeric characters or underscores. The !<< and !>> define follow and precede constraints, forcing the identifier not be preceded / followed by any alpha-numeric character or underscore. This essentially forces a "largest match" on an identifier.

Comments and Whitespace!

Rascal also manages comments / whitespace differently than other parsing / lexical tools that I've used before. In those tools, whenever I observed a whitespace character during tokenization, the lexer would simply skip it, acting as it wasn't there.

Similarly, for comments, any character part of a comment would be ignored by the lexer (and if you had nested comments, this could be a non-trivial process involving some level of state).

Rascal however, has a layout keyword, that defines what can "surround" our syntactic / lexical elements (i.e: comments and whitespace). To this end, Rascal invisibly (in terms of the obtained parse tree) places the layout non-terminal between any non-terminal / terminal in our syntacic definitions. You can read this as "Between any syntactic elements, gobble up layout elements".

Note, I stressed "syntactic" for a reason. layout does not "infect" our lexical definitions.

Here is our definition of our layout non-terminals:

lexical Comment = "{" ![}]* "}";

layout Standard 
= WhitespaceOrComment* 
!>> [
  \u0009-\u000D \u0020 \u0085 \u00A0 \u1680\u180E
  \u2000-\u200A \u2028 \u2029 \u202F \u205F \u3000
  {}
  ];

lexical WhitespaceOrComment 
= whitespace: Whitespace
| comment: Comment
;

Here, we define a comment as being anything inside curly braces, and for whitespace, we actually make use of the built-in definition of whitespace (by extending lang:std::Whitespace).

Observe! We do not allow nested comments :)

Much like Ids, we define our layout such that it gobbles up as much whitespace / comments as possible. This is indicated by our follow constraints, preventing any whitespace (the UTF character codes) or comment deliminators from appearing after a Standard match.

Why is this relevant? Well, consider the following program:

program commentTest;
{A sneaky comment}
{An even sneakier comment}
procedure p1 ; BEGIN END;

BEGIN
END.

Seems innocent enough. What would the Program terminal expand on this program? Maybe something like:

'program' Id  ";"" CompoundStatement "."

Well, recall that Standard will be interleaved between each symbol above (and because Program is a start terminal, it will also be placed as the first non-terminal). So really, our Program terminal is really defined as:

Standard 'program' Standard Id Standard ";" Standard ConstDecl* 
Standard VarDecl* Standard SubProgDecl* Standard CompoundStatement Standard "."

and so the expansion is as follows:

Standard 'program' Standard Id Standar ";" 
Standard Standard Standard Standard CompoundStatement Standard "."

Thus, the question becomes, which Standard should each comment be placed in? Well, our definition of Standard forces both comments to be in the first Standard, but if we didn't have our follow constraint, it could go in any which one (as long as the comments are placed such that they are still in the same order)!

This was a particularly tricky bug to fix (though it's actually documented verbatim here).

Final result

And we're done! We defined our syntax and how let's have a look at our mythical CST for the following tiny, tiny program

program robin;

const cover = 1;

const robins_skill = 0;

const dogukan = 2;

var player : integer;

{ Prints 1 if Robin wins, 0 if it's a draw, -1 if robin loses }
procedure smash(at : integer; opponents_skill : integer) ;
BEGIN
  if (robins_skill > opponents_skill) then
    writeln(1)
  else
    if (robins_skill = opponents_skill) then
      writeln(0)
    else 
      writeln(-1)
END;

BEGIN
  player := dogukan;
  smash(cover, player)
END.

The CST representation can be found here. If you look at it, it's ... well, pretty difficult to read. It contains a lot of details that we don't really care about.

For example:

The breakdown of how an Id is matched
Syntax details that aren't very relevant (then, do, etc.)
Various other things I cannot write down right now, as I'm going to be late for my next meeting :)

We wish to abstract these details away to make it easier to process, and this will actually be our next step.

See you in the next post!

Acknowledgements

Thank you to my friends who read this post and spotted my typos :')