[spec] Phases of translation

Stefanos Baziotis sdi1600105 at di.uoa.gr
Mon May 20 23:10:55 UTC 2019


On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote:
> On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:
>
>> The crucial thing to know is that the tokenizing is 
>> independent of parsing, and parsing is independent of semantic 
>> analysis.
>>
>
> I am trying to understand this aspect - I found the small write 
> up in the Intro section not very clear. Would be great if you 
> could so a session on how things work - maybe video cast?
>

For the first part: "tokenizing is independent of parsing", one 
way to think of it
is that to write a lexer (the sub-program that does the 
tokenizing), you
don't need a parser. You could write it as an independent 
program. That makes sense, to split any token from the input 
('int', '+', '123', etc.) you don't every need to build any AST. 
Or differently, you don't need to know any info about the 
structure of the program.

For the second part, a parser can build an AST (or more 
correctly, can decide
in what grammatical rule it "falls") just by the kinds of tokens 
(e.g. you have a number token, like 123, then '+', so you must be 
parsing a binary expression).

I think that somewhere was mentioned that C++ is _dependent_ on 
the semantic
analysis. To understand this better, think that semantic analysis 
as the act
of giving meaning to the entities of the program (hence the term 
semantic).

For example, if I write: int a;
Well, 'a' is a token, whose kind we could say is an identifier. 
But it doesn't
have any meaning. Knowing that 'a' is a variable with type 'int' 
does now give
us more info. We can now know whether it can be part of this 
expression for example: a + 1.
If 'a' was a string, that would be invalid, and we found the 
error because
we knew the meaning of 'a'.

Now, what is an example of parsing being dependent to the meaning 
of tokens?
Well, C (and C++) of course:
Consider this expression: (a)*b
If 'a' is a variable, then that is really this: a*b
But, if I have done this:
typedef int a;
somewhere above, then now this expression is suddenly "cast the 
dereference
of 'b' to the type 'a' (which is int in this case)".

This is known as the lexer hack [1] (actually, the lexer hack is 
the solution
to the problem).

The important thing to understand here is that we can't parse the 
expression
just by knowing what kind of token 'a' is. We have to know 
additional info
about it, info that is provided by the semantic analysis (in the 
form of the
symbol table, a table which contains info about the symbols of 
the program).
These grammars are known as context-sensitive (i.e. not 
context-free) because
you have to have some context to deduce the grammatical rule.

Note that now suddenly there is no clear distinction between the 
parsing
phase and the semantic phase.

Last but not least, phases being separated clearly has other
important implications beyond comprehensibility. For example, the 
compiler
is paralellized more easily. While the lexer tokenizes file A, 
the parser
could be parsing file B while a semantic analysis is run on file 
C.

[1] https://en.wikipedia.org/wiki/The_lexer_hack


More information about the Digitalmars-d mailing list