Pegged, From EBNF to PEG

Tue Mar 13 09:17:31 PDT 2012

On 12.03.2012 17:45, bls wrote:
> On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
>> On 12.03.2012 16:43, bls wrote:
>>> On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
>>>> Hello,
>>>>
>>>> I created a new Github project, Pegged, a Parsing Expression Grammar
>>>> (PEG) generator in D.
>>>>
>>>> https://github.com/PhilippeSigaud/Pegged
>>>>
>>>> docs: https://github.com/PhilippeSigaud/Pegged/wiki
>>>
>>> Just WOW!
>>>
>>> Nice to have on your WIKI would be a EBNF to PEG sheet.
>>>
>>> Wirth EBNF Pegged
>>> A = BC. A <- B C
>>> A = B|C. A <- C / C
>>
>> Maybe A <- B / C. And even then it's not exactly equivalent if the
>> grammar was ambiguous.
>> Imagine: B <- a, C <- aa
> PEG is pretty new to me. Can you elaborate a bit ?

PEG defines order of alternatives, that is pretty much like a top-down 
recursive descent parser would parse it. Alternatives are tried from 
left to right, if first one fails, it tries next and so on.
In an example I give B is always picked first and so C is never ever 
looked at.

Somewhat less artificial example:
Literal <- IntL| FloatL
FloatL <- [0-9]+(.[0-9]+)?
IntL <- [0-9]+

If you change it to: Literal <- FloatL| IntL then integer literals would 
get parsed as floating point.

>>
>
> My mistake.. cleaned up stuff..
>
> Pegged Wirth EBNF
>
> Sequence
> A <- B C A = BC.
>
> B or C
> A <- B / C A = B|C.
>
> Zero or one B
> A <- B? A = [B].
>
> Zero or more Bs
> A <- B* A = {B}.
>
> One or more Bs
> A <- B+ Not available
>
> PEG description of EBNF
>
> EBNF <- Procuction+
> Production <- Identifier '=' Expression '.'
> Expression <- Term ( '|' Term)*
> Term <- Factor Factor*
> Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}'
> / '(' Expression ')'
> lowerCase <- [a-z]
> upperCase <- [A-Z]
> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

Why not:
Identifier <- [a-zA-Z]+

> Literal <- ("'" .+ "'") / ('"' .+ '"')
>
This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.

> Still not sure if this is correct. Especially :
> Term <- Factor Factor*
>
>
> Another thing I never really understand is the "production" order, In
> other words : Why not top down ..
> Start :
> lowerCase <- [a-z]
> upperCase <- [A-Z]
> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

> ....
>
> End :
> EBNF <- Procuction+
>
> where End is Root..

In fact grammars are usually devised the other way around, e.g.
Start:
  Program -> ...
Ehm... what the whole program is exactly ? Ok, let it be Declaration* 
for now. What kind of declarations do we have? ... and so on. Latter 
grammars get tweaked and extended numerous times.

At any rate production order has no effect on the grammar, it's still 
the same. The only thing of importance is what non-terminal considered 
final (or start if you are LL-centric).

>
> TIA, Bjoern

-- 
Dmitry Olshansky