[dox] Fixing the lexical rule for BinaryInteger
Andre Artus
andre.artus at gmail.com
Sat Aug 17 14:29:03 PDT 2013
>>>>> Andre Artus wrote:
>>>>> 2. Your BinaryInteger and HexadecimalInteger only allow for
>>>>> one of the following (reduced) cases:
>>>>>
>>>>> 0b1__ : works
>>>>> 0b_1_ : fails
>>>>> 0b__1 : fails
>>>> Brian Schott wrote:
>>>> It's my opinion that the compiler should reject all of these
>>>> because I think of the underscore as a separator between
>>>> digits,
>>>> but I'm constantly fighting the "spec, dmd, and idiom all
>>>> disagree" issue.
I agree with you, Brian, all three of these constructions go
contrary to the goal of making the code clearer. I would not be
too surprised if a significant number of programmers would see
those as different numbers, at least until they paused to take it
in.
>>> [...]
>>> H. S. Teoh wrote:
>>>
>>> I remember reading this part of the spec on dlang.org, and I
>>> wonder if it was worded the way it is just for simplicity,
>>> because to specify something like "_ must appear between
>>> digits" involves some complicated BNF rules, which maybe
>>> seems like overkill for a single literal.
I think that you are right.
>>> H. S. Teoh wrote:
>>>
>>> But sometimes it is good to be precise, if we want to enforce
>>> "proper" conventions for underscores:
>>>
>>> <binaryLiteral> ::= "0b" <binaryDigits>
>>> <underscoreBinaryDigits>
>>>
>>> <binaryDigits> ::= <binaryDigit> <binaryDigits>
>>> | <binaryDigit>
>>>
>>> <underscoreBinaryDigits>
>>> ::= ""
>>> | "_" <binaryDigits>
>>> | "_" <binaryDigits>
>>> <underscoreBinaryDigits>
>>>
>>> <binaryDigit> ::= "0"
>>> | "1"
>>>
>>> This BNF spec forces "_" to only appear between two binary
>>> digits, and never more than a single _ in a row.
>> Andre Artus wrote:
>> Yup, that's the issue. Coding the actual behaviour by hand, or
>> doing it with a regular expression, is close to trivial.
>>> H. S. Teoh wrote:
>>> You can also make your parser only pick up <binaryDigit> when
>>> performing semantic on binary literals, so the other stuff is
>>> ignored and only serves to enforce syntax.
>> Andre Artus wrote:
>> Pushing it up to the parser is an option in implementation,
>> but I don't see that making the specification easier (it's
>> 3:40 in the morning here, so I am very likely not thinking too
>> clearly about this).
> H. S. Teoh wrote:
> I didn't mean to push it up to the parser.
Sorry I misunderstood, I had been up for over 21 hours at the
time I wrote, so it was getting a bit difficult for me to
concentrate. I got the impression you were saying that the parser
would be responsible for extracting the binary digits.
> H. S. Teoh wrote:
> I was just using BNF to show that it's possible to specify the
> behaviour precisely.
> And also that it's rather convoluted just for something as
> intuitively straightforward as an integer literal. Which is a
> likely reason why the current specs are a bit blurry about what
> should/shouldn't be allowed.
I don't think I've seen lexemes defined using (a variant of) BNF
before, most often a form of regular expressions are used. One
could cut down and clarify the page describing the lexical syntax
significantly employing simple regular expressions.
>>> H. S. Teoh wrote:
>>> I'd be surprised if there's any D code out there that doesn't
>>> fit this spec, to be honest.
>> Andre Artus wrote:
>> It's not what I would call best practice, but the following is
>> possible in the current compiler:
>>
>>> auto myBin1 = 0b0011_1101; // Sane
>>> auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
>>> auto myBin3 = 0b____1___; // Trouble, myBin3 == 1
>>
>> Which means a tools built against the documented spec are
>> going to choke on these weird cases. Personally I would prefer
>> if the more questionable options were not allowed as they
>> potentially defeat the goal of improving clarity. But, that's
>> a breaking change.
> H. S. Teoh wrote:
> I know that, but I'm saying that hardly *any* code would break
> if we made DMD reject things like this. I don't think anybody
> in their right mind would write code like that. (Unless they
> were competing in the IODCC... :-P)
I agree that the compiler should probably break that code, I
believe some breaking changes are good when they help the
programmer fix potential bugs. But I am also someone who compiles
with "Treat warnings as errors".
> H. S. Teoh wrote:
> The issue here is that when specs / DMD / TDPL don't agree,
> then it's not always clear which among the three are wrong.
> Perhaps *all* of them are wrong. Just because DMD accepts
> invalid code doesn't mean it should be part of the specs, for
> example. It could constitute a DMD bug.
It would be good to get some clarification on this.
>>> H. S. Teoh wrote:
>>> But if you want to accept "strange" literals like 0b__1__,
>>> you could do something like:
>>>
>>> <binaryLiteral> ::= "0b" <underscoreBinaryDigits>
>>> <binaryDigit>
>>> <underscoreBinaryDigits>
>>>
>>> <underscoreBinaryDigits> ::= "_"
>>> | "_" <underscoreBinaryDigits>
>>> | <binaryDigit>
>>> | <binaryDigit>
>>> <underscoreBinaryDigits>
>>> | ""
>>>
>>> <binaryDigit> ::= "0"
>>> | "1"
>>>
>>> The odd form of the rule for <binaryLiteral> is to ensure that
>>> there's at least one binary digit in the string, whereas
>>> <underscoreBinaryDigits> is just a wildcard anything-goes
>>> rule that takes any combination of 0, 1, and _, including the
>>> empty string.
>> Andre Artus wrote:
>> The rule that matches the DMD compiler is actually very easy
>> to do in ANTLR4, i.e.
>>
>> BinaryLiteral : '0b' [_01]* [01] [_01]* ;
>>
>> I'm a bit too tired to fully pay attention, but it seems you
>> are saying that "0b" (no additional numbers) should match,
>> which I believe it should not (although I admit to not testing
>> this). If it does then I would consider that a bug.
> H. S. Teoh wrote:
> No, the BNF rules I wrote are equivalent to your ANTLR4 spec.
> Which is equivalent to the regex I posted later.
I should have paid better attention, as I missed the
<binaryDigit> in <binaryLiteral>. To be honest I was having a
hard time focusing due to lack of sleep and a pervading stench of
paint fumes seeping in from the adjacent building.
>> Andre Artus wrote:
>> It's not a problem implementing the rule, I am more concerned
>> with documenting it in a clear and unambiguous way so that
>> people building tools from it can get it right. BNF isn't
>> always the easiest way to do so, but it's what being used.
> H. S. Teoh wrote:
> Well, you could bug Walter about what *should* be accepted,
I'm not sure how to go about that.
> H. S. Teoh wrote:
> and if he agrees to restrict it to having _ only between two
> digits, then you'd file a bug against DMD.
Well if we could get a ruling on this then we could include
HexadecimalInteger in the ruling as it has similar behaviour in
DMD.
The current specification for DecimalInteger also allows a
trailing sequence of underscores. It also does not include the
sign as part of the token value.
Possible regex alternatives (note I do not include the sign, as
per current spec).
(0|[1-9]([_]*[0-9])*)
or arguably better
(0|[1-9]([_]?[0-9])*)
> H. S. Teoh wrote:
> Again, I seriously doubt that such a change would cause any
> code breakage, because writing 0b1 as 0b____1____ is just so
> ridiculous that any such code *should* be broken.
Agreed.
More information about the Digitalmars-d
mailing list