Creating a compiler for a subset of Pascal in Rascal - Syntax
Introduction
When designing a language, your very first step is to define what constitutes a syntactically valid program, in your language. For most, this involves writing a context-free grammar for this purpose and then writing a program that accepts / rejects input files by whether they respect the grammar.
Furthermore, if your program is a compiler, it will also need to create an internal representation of the program. There are actually a few of these, but the very first one that a compiler will create is the Concrete Syntax Tree (CST) of your input file. This is the representation of the exact text in your file (sometimes modulo whitespace / comments).
Creating the CST is the goal of today's article, so buckle in! You can follow the project on my git server and you can browse the source at this specific point in the project here.
Do note, I'm going to assume you know how to read a context-free grammar. It's relatively easy to understand via demonstration, and Wikipedia has some nice examples.
The Syntax
I'm going to discuss each aspect of the syntax of MiniPas, but if you have a bit of experience with EBNF, it might be faster to just read the syntax in full and come back to this section if there is any confusing components. You can find it here, in all it's fully commented glory.
The Top Level Sturcture
The starting rule in our syntax is as follows:
start syntax Program = 'program' Id ";" ConstDecl* VarDecl* SubProgDecl* CompoundStatement ".";This is our starting rule (as indicated by the
start
keyword) and it says that a valid MiniPas program:
- Starts with the name of the program (as captured by
Id
) - Continues with constant declarations (as captured by
ConstDecl*
) - Continues with variable declarations (as captured by
VarDecl*
) - Continues with sub-program declarations (as captured by
SubProgDecl*
) - And finishes with the main body of the program (as captured by
CompoundStatement
)
A few notes:
-
We use the term "sub-program", instead of functions as MiniPas has two notions of function:
- Functions, which take one or more arguments and return a value.
- Procedures, which take zero or more arguments and do not return a value.
-
If you're unfamiliar with EBNF, if you have a non-terminal
A
, thenA*
indicates thatA
can match zero or more strings that would be captured byA
. For completeness,A+
allows one or more matches andA?
allows zero or one matches. -
Observe the strings between
''
and""
. These are terminals within our program and define syntactic elements. So a proper MiniPas program would start with:program name;
Note that strings inside''
are case insensitive, whereas strings inside""
are not. So the following is exactly equivalent to the above:PrOgRaM name;
Data Declarations
Our syntax allows for constants and variables:
syntax ConstDecl = 'const' Id "=" Number ";"; syntax VarDecl = 'var' IdList ":" TypeSpec ";"with (potentially arrays of) integer and real numbered types:
syntax TypeSpec = BasicType | 'array' "[" Number ".." Number "]" 'of' BasicType; syntax BasicType = 'integer' | 'real';Observe that variable declarations allow a list of variables to be declared at once, and our arrays:
- Can have index ranges that are not 0-indexed
- Have sizes that are completely known at compile time
Further, observe that this syntax allows indexing by real numbers. We actually do not want this! We could define another syntactic (technically, lexical) element for indexing ... but this is not ideal. We will see how we can deal with this, in a future post on semantic analysis.
Sub-Programs
Our syntax allows for functions / procedures:
syntax SubProgDecl = 'function' Id Parameters ":" BasicType ";" VarDecl* CompoundStatement ";" | 'procedure' Id Parameters? ";" VarDecl* CompoundStatement ";";This broadly matches our description of functions / procedures from earlier. The
CompoundStatement
defines the body of the function, and you are allowed to define local variables before the body begins via
VarDecl*
.
Parameters are declared as follows:
syntax Parameters = "(" ParameterList ")"; syntax ParameterList = ParamList (";" ParamList)*; syntax ParamList = 'var'? IdList ":" TypeSpec;As you can see, we can declare multiple parameters in a single declaration (
ParamList
), the collection of
which (ParameterList
) defines all our parameters. Perhaps more curiously, what is the
'var'
doing here? Well, this is how MiniPas declares reference variables! Otherwise, arguments are passed
by value.
Now, the Pacalers amongst you might have noticed an incongruency with regular Pascal here. You can declare the size of arrays in MiniPas, as part of the parameter type definition. There are only static arrays in MiniPas and arrays passed as a parameter in MiniPas must fit the size declared in the parameter.
However, the indexing range for arrays need not actually match that declared in the parameters. They are essentially
"re-mapped", when used within
the function. So you can pass a 0-indexed array (say a
) into a function expecting a 1-indexed array,
and inside the function a[1]
refers to a[0]
in the outer scope.
Statements
We can define (a sequence) of statements as follows:
syntax CompoundStatement = 'begin' StatementList? 'end'; syntax StatementList = Statement (";" Statement)*; syntax Statement = Lhs ":=" Expr | SubProgCall | CompoundStatement | 'skip' | 'if' Guard 'then' Statement 'else' Statement | 'while' Guard 'do' Statement ; syntax SubProgCall = 'readln'"(" LhsList ")" | 'writeln'"(" ExprList ")" > Id ("(" ExprList ")")? ;And we have statements for (and I'm listing them in order of the definition above):
- Assignments
- Sub-Program calls
- Nested statements
- A no-operation
- Conditional statements
- Looping statements
A keen reader will observe that skip
is a redundant statement, that can be replaced with an empty
CompoundStatement
. Also, one might wonder why not use StatementList
instead of
CompoundStatement
.
Well, the former is actually an artifact of the Compiler Construction course, to allow
empty bodies, without starting a new CompoundStatement
for that. The latter is more interesting. It's
actually to prevent ambiguities!
Suppose we used StatementList
instead. Then we can match x := 10
in two ways:
Statement -> StatementList -> Statement (";" Statement)* -> Statement -> Lhs ":=" Expr -> x ":=" Expr -> x ":=" 10 Statement -> Lhs ":=" Expr -> x ":=" Expr -> x ":=" 10This gives rise to multiple parse trees, thus making our grammar ambiguous!
Also, really quickly, for completeness, we define Lhs, LhsList
, which just restricts the expressions we
can assign to
(i.e: just variables and specific locations in an array):
syntax LhsList = Lhs ("," Lhs)*; syntax Lhs = Id ("[" Expr "]")?;
Expressions
Finally, we present the syntax for expressions. Note that since we don't have Booleans are a type, Guard expressions have their own syntax.
syntax Guard = "(" Guard ")" > Expr Relop Expr > 'not' Guard > left (Guard 'or' Guard | Guard 'and' Guard) ;Our expressions contain the standard arithmetic operations (as well as integer division / modulo operations).
syntax ExprList = Expr ("," Expr)*; syntax Expr = "(" Expr ")" | Id | Id "[" Expr (".." Expr)?"]" > Number | Id "(" ExprList ")" > "-" Expr > left ( Expr "*" Expr | Expr "/" Expr | Expr 'div' Expr | Expr 'mod' Expr ) > left (Expr "+" Expr | Expr "-" Expr) ;Perhaps the most interesting line is
Id "[" Expr (".." Expr)?"]"
. It allows for the for the following
expressions:
Id
: This is simple variable referencingId [Expr]
: This is simple array indexingId [Expr .. Expr]
: Now this is interesting! This is array slicing. It defines a sub-array of an array.
Now, if you recall, arithmetic expressions have a notion of priorities. This is expressly encoded with the use of
>
. Furthermore, notions of associativity are encoded with the left
keyword, which forces
all alternatives within the parenthesis to match strings in a left associative manner.
Tokenizing
Tokenizing in Rascal is actually quite different in Rascal (at least to what I've seen). Usually, tokenizing involves using a lexer to scan through text and collect them into tokens, which is the atom of detail in which our parser works at.
As a brief example, if you had the following:
program my_program;your lexer would probably return the following tokens:
PROGRAM
, ID
, SEMICOLON
.
In
this way your parser does not need to deal with the notion of characters at all.
However, there is no such phase in Rascal (at least in a user-facing manner. Perhaps in the backend, it does essentially create a lexer). Rather, they are simply declared, akin to syntactic elements:
keyword Reserved = 'program' | 'const' | 'var' | 'array' | 'of' | 'integer' | 'real' | 'function' | 'procedure' | 'begin' | 'end' | 'skip' | 'if' | 'then' | 'else' | 'while' | 'do' | 'not' | 'or' | 'and' | 'readln' | 'writeln' ; lexical Id = ([a-z A-Z 0-9 _] !<< [a-z A-Z][a-z A-Z 0-9 _]* !>> [a-z A-Z 0-9 _]) \ Reserved;The above says that an Id is a string starting with an alphabetic character (
[a-zA-Z]
), followed by any
number of alpha-numeric characters or underscores. The !<<
and !>>
define follow and precede
constraints, forcing the identifier not be preceded / followed by any alpha-numeric character or underscore. This
essentially
forces a "largest match" on an identifier.
Comments and Whitespace!
Rascal also manages comments / whitespace differently than other parsing / lexical tools that I've used before. In those tools, whenever I observed a whitespace character during tokenization, the lexer would simply skip it, acting as it wasn't there.
Similarly, for comments, any character part of a comment would be ignored by the lexer (and if you had nested comments, this could be a non-trivial process involving some level of state).
Rascal however, has a layout
keyword, that defines what can "surround" our syntactic / lexical elements
(i.e: comments and whitespace). To this end, Rascal invisibly (in terms of the obtained parse tree) places the
layout
non-terminal between any non-terminal / terminal in our syntacic definitions. You can
read this as "Between any syntactic elements, gobble up layout elements".
Note, I stressed "syntactic" for a reason. layout
does not "infect" our lexical definitions.
Here is our definition of our layout non-terminals:
lexical Comment = "{" ![}]* "}"; layout Standard = WhitespaceOrComment* !>> [ \u0009-\u000D \u0020 \u0085 \u00A0 \u1680\u180E \u2000-\u200A \u2028 \u2029 \u202F \u205F \u3000 {} ]; lexical WhitespaceOrComment = whitespace: Whitespace | comment: Comment ;Here, we define a comment as being anything inside curly braces, and for whitespace, we actually make use of the built-in definition of whitespace (by extending
lang:std::Whitespace
).
Observe! We do not allow nested comments :)
Much like Ids, we define our layout
such that it gobbles up as much whitespace / comments as possible.
This is indicated by
our follow constraints, preventing any whitespace (the UTF character codes) or comment deliminators from appearing
after a Standard
match.
Why is this relevant? Well, consider the following program:
program commentTest; {A sneaky comment} {An even sneakier comment} procedure p1 ; BEGIN END; BEGIN END.Seems innocent enough. What would the
Program
terminal expand on this program? Maybe something like:
'program' Id ";"" CompoundStatement "."Well, recall that
Standard
will be interleaved between each symbol above (and because
Program
is a start
terminal, it will also be placed as the first non-terminal). So really,
our Program
terminal is really defined as:
Standard 'program' Standard Id Standard ";" Standard ConstDecl* Standard VarDecl* Standard SubProgDecl* Standard CompoundStatement Standard "."and so the expansion is as follows:
Standard 'program' Standard Id Standar ";" Standard Standard Standard Standard CompoundStatement Standard "."Thus, the question becomes, which
Standard
should each comment be placed in? Well, our definition of
Standard
forces
both comments to be in the first Standard
, but if we didn't have our follow constraint, it could go in
any which one (as long as the comments are placed such that they are still in the same order)!
This was a particularly tricky bug to fix (though it's actually documented verbatim here).
Final result
And we're done! We defined our syntax and how let's have a look at our mythical CST for the following tiny, tiny program
program robin; const cover = 1; const robins_skill = 0; const dogukan = 2; var player : integer; { Prints 1 if Robin wins, 0 if it's a draw, -1 if robin loses } procedure smash(at : integer; opponents_skill : integer) ; BEGIN if (robins_skill > opponents_skill) then writeln(1) else if (robins_skill = opponents_skill) then writeln(0) else writeln(-1) END; BEGIN player := dogukan; smash(cover, player) END.
The CST representation can be found here. If you look at it, it's ... well, pretty difficult to read. It contains a lot of details that we don't really care about.
For example:
- The breakdown of how an Id is matched
- Syntax details that aren't very relevant (
then
,do
, etc.) - Various other things I cannot write down right now, as I'm going to be late for my next meeting :)
We wish to abstract these details away to make it easier to process, and this will actually be our next step.
See you in the next post!
Acknowledgements
Thank you to my friends who read this post and spotted my typos :')