[dox] Fixing the lexical rule for BinaryInteger

Sat Aug 17 14:29:03 PDT 2013

>>>>> Andre Artus wrote:
>>>>> 2. Your BinaryInteger and HexadecimalInteger only allow for
>>>>> one of the following (reduced) cases:
>>>>> 
>>>>> 0b1__ : works
>>>>> 0b_1_ : fails
>>>>> 0b__1 : fails

>>>> Brian Schott wrote:
>>>> It's my opinion that the compiler should reject all of these 
>>>> because I think of the underscore as a separator between 
>>>> digits,
>>>> but I'm constantly fighting the "spec, dmd, and idiom all 
>>>> disagree" issue.

I agree with you, Brian, all three of these constructions go 
contrary to the goal of making the code clearer. I would not be 
too surprised if a significant number of programmers would see 
those as different numbers, at least until they paused to take it 
in.

>>> [...]
>>> H. S. Teoh wrote:
>>> 
>>> I remember reading this part of the spec on dlang.org, and I 
>>> wonder if it was worded the way it is just for simplicity, 
>>> because to specify something like "_ must appear between 
>>> digits" involves some complicated BNF rules, which maybe 
>>> seems like overkill for a single literal.

I think that you are right.

>>> H. S. Teoh wrote:
>>> 
>>> But sometimes it is good to be precise, if we want to enforce
>>> "proper" conventions for underscores:
>>> 
>>> <binaryLiteral>  ::= "0b" <binaryDigits>
>>>                     <underscoreBinaryDigits>
>>> 
>>> <binaryDigits>   ::= <binaryDigit> <binaryDigits>
>>>                    | <binaryDigit>
>>> 
>>> <underscoreBinaryDigits>
>>>                  ::= ""
>>>                    | "_" <binaryDigits>
>>>                    | "_" <binaryDigits> 
>>> <underscoreBinaryDigits>
>>> 
>>> <binaryDigit>    ::= "0"
>>>                    | "1"
>>> 
>>> This BNF spec forces "_" to only appear between two binary 
>>> digits, and never more than a single _ in a row.

>> Andre Artus wrote:
>> Yup, that's the issue. Coding the actual behaviour by hand, or 
>> doing it with a regular expression, is close to trivial.

>>> H. S. Teoh wrote:
>>> You can also make your parser only pick up <binaryDigit> when
>>> performing semantic on binary literals, so the other stuff is 
>>> ignored and only serves to enforce syntax.

>> Andre Artus wrote:
>> Pushing it up to the parser is an option in implementation, 
>> but I don't see that making the specification easier (it's 
>> 3:40 in the morning here, so I am very likely not thinking too 
>> clearly about this).

> H. S. Teoh wrote:
> I didn't mean to push it up to the parser.

Sorry I misunderstood, I had been up for over 21 hours at the 
time I wrote, so it was getting a bit difficult for me to 
concentrate. I got the impression you were saying that the parser 
would be responsible for extracting the binary digits.

> H. S. Teoh wrote:
> I was just using BNF to show that it's possible to specify the 
> behaviour precisely.
> And also that it's rather convoluted just for something as 
> intuitively straightforward as an integer literal. Which is a 
> likely reason why the current specs are a bit blurry about what 
> should/shouldn't be allowed.

I don't think I've seen lexemes defined using (a variant of) BNF 
before, most often a form of regular expressions are used. One 
could cut down and clarify the page describing the lexical syntax 
significantly employing simple regular expressions.

>>> H. S. Teoh wrote:
>>> I'd be surprised if there's any D code out there that doesn't 
>>> fit this spec, to be honest.

>> Andre Artus wrote:
>> It's not what I would call best practice, but the following is
>> possible in the current compiler:
>> 
>>> 	auto myBin1 = 0b0011_1101; // Sane
>>> 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
>>> 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1
>> 
>> Which means a tools built against the documented spec are 
>> going to choke on these weird cases. Personally I would prefer 
>> if the more questionable options were not allowed as they 
>> potentially defeat the goal of improving clarity. But, that's 
>> a breaking change.

> H. S. Teoh wrote:
> I know that, but I'm saying that hardly *any* code would break 
> if we made DMD reject things like this. I don't think anybody 
> in their right mind would write code like that. (Unless they 
> were competing in the IODCC... :-P)

I agree that the compiler should probably break that code, I 
believe some breaking changes are good when they help the 
programmer fix potential bugs. But I am also someone who compiles 
with "Treat warnings as errors".

> H. S. Teoh wrote:
> The issue here is that when specs / DMD / TDPL don't agree, 
> then it's not always clear which among the three are wrong. 
> Perhaps *all* of them are wrong. Just because DMD accepts 
> invalid code doesn't mean it should be part of the specs, for 
> example. It could constitute a DMD bug.

It would be good to get some clarification on this.

>>> H. S. Teoh wrote:
>>> But if you want to accept "strange" literals like 0b__1__, 
>>> you could do something like:
>>> 
>>> <binaryLiteral>          ::= "0b" <underscoreBinaryDigits>
>>>                              <binaryDigit>
>>>                              <underscoreBinaryDigits>
>>> 
>>> <underscoreBinaryDigits> ::= "_"
>>>                            | "_" <underscoreBinaryDigits>
>>>                            | <binaryDigit>
>>>                            | <binaryDigit> 
>>> <underscoreBinaryDigits>
>>>                            | ""
>>> 
>>> <binaryDigit>            ::= "0"
>>>                            | "1"
>>> 
>>> The odd form of the rule for <binaryLiteral> is to ensure that
>>> there's at least one binary digit in the string, whereas
>>> <underscoreBinaryDigits> is just a wildcard anything-goes 
>>> rule that takes any combination of 0, 1, and _, including the
>>> empty string.

>> Andre Artus wrote:
>> The rule that matches the DMD compiler is actually very easy 
>> to do in ANTLR4, i.e.
>> 
>> BinaryLiteral   : '0b' [_01]* [01] [_01]* ;
>> 
>> I'm a bit too tired to fully pay attention, but it seems you 
>> are saying that "0b" (no additional numbers) should match, 
>> which I believe it should not (although I admit to not testing 
>> this). If it does then I would consider that a bug.

> H. S. Teoh wrote:
> No, the BNF rules I wrote are equivalent to your ANTLR4 spec. 
> Which is equivalent to the regex I posted later.

I should have paid better attention, as I missed the 
<binaryDigit> in <binaryLiteral>. To be honest I was having a 
hard time focusing due to lack of sleep and a pervading stench of 
paint fumes seeping in from the adjacent building.

>> Andre Artus wrote:
>> It's not a problem implementing the rule, I am more concerned 
>> with documenting it in a clear and unambiguous way so that 
>> people building tools from it can get it right. BNF isn't 
>> always the easiest way to do so, but it's what being used.

> H. S. Teoh wrote:
> Well, you could bug Walter about what *should* be accepted,

I'm not sure how to go about that.

> H. S. Teoh wrote:
> and if he agrees to restrict it to having _ only between two 
> digits, then you'd file a bug against DMD.

Well if we could get a ruling on this then we could include 
HexadecimalInteger in the ruling as it has similar behaviour in 
DMD.

The current specification for DecimalInteger also allows a 
trailing sequence of underscores. It also does not include the 
sign as part of the token value.

Possible regex alternatives (note I do not include the sign, as 
per current spec).

(0|[1-9]([_]*[0-9])*)

or arguably better
(0|[1-9]([_]?[0-9])*)

> H. S. Teoh wrote:
> Again, I seriously doubt that such a change would cause any 
> code breakage, because writing 0b1 as 0b____1____ is just so 
> ridiculous that any such code *should* be broken.

Agreed.