[dox] Fixing the lexical rule for BinaryInteger

Fri Aug 16 19:02:40 PDT 2013

On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
> On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
>> On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
> [...]
>> >2. Your BinaryInteger and HexadecimalInteger only allow for 
>> >one of
>> >the following (reduced) cases:
>> >
>> >0b1__ : works
>> >0b_1_ : fails
>> >0b__1 : fails
>> 
>> It's my opinion that the compiler should reject all of these 
>> because
>> I think of the underscore as a separator between digits, but 
>> I'm
>> constantly fighting the "spec, dmd, and idiom all disagree" 
>> issue.
> [...]
>
> I remember reading this part of the spec on dlang.org, and I 
> wonder if
> it was worded the way it is just for simplicity, because to 
> specify
> something like "_ must appear between digits" involves some 
> complicated
> BNF rules, which maybe seems like overkill for a single literal.
>
> But sometimes it is good to be precise, if we want to enforce 
> "proper"
> conventions for underscores:
>
> <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>
>
> <binaryDigits> ::= <binaryDigit> <binaryDigits>
> 		| <binaryDigit>
>
> <underscoreBinaryDigits> ::= ""
> 		| "_" <binaryDigits>
> 		| "_" <binaryDigits> <underscoreBinaryDigits>
>
> <binaryDigit> ::= "0"
> 		| "1"
>
> This BNF spec forces "_" to only appear between two binary 
> digits, and never more than a single _ in a row.

Yup, that's the issue. Coding the actual behaviour by hand, or 
doing it with a regular expression, is close to trivial.

> You can also make your parser only
> pick up <binaryDigit> when performing semantic on binary 
> literals, so
> the other stuff is ignored and only serves to enforce syntax.

Pushing it up to the parser is an option in implementation, but I 
don't see that making the specification easier (it's 3:40 in the 
morning here, so I am very likely not thinking too clearly about 
this).

>
> I'd be surprised if there's any D code out there that doesn't 
> fit this
> spec, to be honest.

It's not what I would call best practice, but the following is 
possible in the current compiler:

> 	auto myBin1 = 0b0011_1101; // Sane
> 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
> 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

Which means a tools built against the documented spec are going 
to choke on these weird cases. Personally I would prefer if the 
more questionable options were not allowed as they potentially 
defeat the goal of improving clarity. But, that's a breaking 
change.

>
> But if you want to accept "strange" literals like 0b__1__, you 
> could do
> something like:
>
> <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> 
> <underscoreBinaryDigits>
>
> <underscoreBinaryDigits> ::= "_"
> 		| "_" <underscoreBinaryDigits>
> 		| <binaryDigit>
> 		| <binaryDigit> <underscoreBinaryDigits>
> 		| ""
>
> <binaryDigit> ::= "0"
> 		| "1"
>
> The odd form of the rule for <binaryLiteral> is to ensure that 
> there's
> at least one binary digit in the string, whereas
> <underscoreBinaryDigits> is just a wildcard anything-goes rule 
> that
> takes any combination of 0, 1, and _, including the empty 
> string.
>

The rule that matches the DMD compiler is actually very easy to 
do in ANTLR4, i.e.

BinaryLiteral   : '0b' [_01]* [01] [_01]* ;

I'm a bit too tired to fully pay attention, but it seems you are 
saying that "0b" (no additional numbers) should match, which I 
believe it should not (although I admit to not testing this). If 
it does then I would consider that a bug.

It's not a problem implementing the rule, I am more concerned 
with documenting it in a clear and unambiguous way so that people 
building tools from it can get it right. BNF isn't always the 
easiest way to do so, but it's what being used.