[dox] Fixing the lexical rule for BinaryInteger

Andre Artus andre.artus at gmail.com
Fri Aug 16 19:02:40 PDT 2013


On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
> On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
>> On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
> [...]
>> >2. Your BinaryInteger and HexadecimalInteger only allow for 
>> >one of
>> >the following (reduced) cases:
>> >
>> >0b1__ : works
>> >0b_1_ : fails
>> >0b__1 : fails
>> 
>> It's my opinion that the compiler should reject all of these 
>> because
>> I think of the underscore as a separator between digits, but 
>> I'm
>> constantly fighting the "spec, dmd, and idiom all disagree" 
>> issue.
> [...]
>
> I remember reading this part of the spec on dlang.org, and I 
> wonder if
> it was worded the way it is just for simplicity, because to 
> specify
> something like "_ must appear between digits" involves some 
> complicated
> BNF rules, which maybe seems like overkill for a single literal.
>
> But sometimes it is good to be precise, if we want to enforce 
> "proper"
> conventions for underscores:
>
> <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>
>
> <binaryDigits> ::= <binaryDigit> <binaryDigits>
> 		| <binaryDigit>
>
> <underscoreBinaryDigits> ::= ""
> 		| "_" <binaryDigits>
> 		| "_" <binaryDigits> <underscoreBinaryDigits>
>
> <binaryDigit> ::= "0"
> 		| "1"
>
> This BNF spec forces "_" to only appear between two binary 
> digits, and never more than a single _ in a row.

Yup, that's the issue. Coding the actual behaviour by hand, or 
doing it with a regular expression, is close to trivial.

> You can also make your parser only
> pick up <binaryDigit> when performing semantic on binary 
> literals, so
> the other stuff is ignored and only serves to enforce syntax.

Pushing it up to the parser is an option in implementation, but I 
don't see that making the specification easier (it's 3:40 in the 
morning here, so I am very likely not thinking too clearly about 
this).

>
> I'd be surprised if there's any D code out there that doesn't 
> fit this
> spec, to be honest.

It's not what I would call best practice, but the following is 
possible in the current compiler:

> 	auto myBin1 = 0b0011_1101; // Sane
> 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
> 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

Which means a tools built against the documented spec are going 
to choke on these weird cases. Personally I would prefer if the 
more questionable options were not allowed as they potentially 
defeat the goal of improving clarity. But, that's a breaking 
change.

>
> But if you want to accept "strange" literals like 0b__1__, you 
> could do
> something like:
>
> <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> 
> <underscoreBinaryDigits>
>
> <underscoreBinaryDigits> ::= "_"
> 		| "_" <underscoreBinaryDigits>
> 		| <binaryDigit>
> 		| <binaryDigit> <underscoreBinaryDigits>
> 		| ""
>
> <binaryDigit> ::= "0"
> 		| "1"
>
> The odd form of the rule for <binaryLiteral> is to ensure that 
> there's
> at least one binary digit in the string, whereas
> <underscoreBinaryDigits> is just a wildcard anything-goes rule 
> that
> takes any combination of 0, 1, and _, including the empty 
> string.
>

The rule that matches the DMD compiler is actually very easy to 
do in ANTLR4, i.e.

BinaryLiteral   : '0b' [_01]* [01] [_01]* ;


I'm a bit too tired to fully pay attention, but it seems you are 
saying that "0b" (no additional numbers) should match, which I 
believe it should not (although I admit to not testing this). If 
it does then I would consider that a bug.

It's not a problem implementing the rule, I am more concerned 
with documenting it in a clear and unambiguous way so that people 
building tools from it can get it right. BNF isn't always the 
easiest way to do so, but it's what being used.


More information about the Digitalmars-d mailing list