[dox] Fixing the lexical rule for BinaryInteger

Sat Aug 17 09:40:41 PDT 2013

On Sat, Aug 17, 2013 at 04:02:40AM +0200, Andre Artus wrote:
> On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
> >On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
> >>On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
> >[...]
> >>>2. Your BinaryInteger and HexadecimalInteger only allow for
> >>>one of
> >>>the following (reduced) cases:
> >>>
> >>>0b1__ : works
> >>>0b_1_ : fails
> >>>0b__1 : fails
> >>
> >>It's my opinion that the compiler should reject all of these because
> >>I think of the underscore as a separator between digits, but I'm
> >>constantly fighting the "spec, dmd, and idiom all disagree"
> >>issue.
> >[...]
> >
> >I remember reading this part of the spec on dlang.org, and I wonder
> >if it was worded the way it is just for simplicity, because to
> >specify something like "_ must appear between digits" involves some
> >complicated BNF rules, which maybe seems like overkill for a single
> >literal.
> >
> >But sometimes it is good to be precise, if we want to enforce
> >"proper" conventions for underscores:
> >
> ><binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>
> >
> ><binaryDigits> ::= <binaryDigit> <binaryDigits>
> >		| <binaryDigit>
> >
> ><underscoreBinaryDigits> ::= ""
> >		| "_" <binaryDigits>
> >		| "_" <binaryDigits> <underscoreBinaryDigits>
> >
> ><binaryDigit> ::= "0"
> >		| "1"
> >
> >This BNF spec forces "_" to only appear between two binary digits,
> >and never more than a single _ in a row.
> 
> Yup, that's the issue. Coding the actual behaviour by hand, or doing
> it with a regular expression, is close to trivial.
> 
> >You can also make your parser only pick up <binaryDigit> when
> >performing semantic on binary literals, so the other stuff is ignored
> >and only serves to enforce syntax.
> 
> Pushing it up to the parser is an option in implementation, but I
> don't see that making the specification easier (it's 3:40 in the
> morning here, so I am very likely not thinking too clearly about
> this).

I didn't mean to push it up to the parser. I was just using BNF to show
that it's possible to specify the behaviour precisely. And also that
it's rather convoluted just for something as intuitively straightforward
as an integer literal. Which is a likely reason why the current specs
are a bit blurry about what should/shouldn't be allowed.

> >I'd be surprised if there's any D code out there that doesn't fit
> >this spec, to be honest.
> 
> It's not what I would call best practice, but the following is
> possible in the current compiler:
> 
> >	auto myBin1 = 0b0011_1101; // Sane
> >	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
> >	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1
> 
> Which means a tools built against the documented spec are going to
> choke on these weird cases. Personally I would prefer if the more
> questionable options were not allowed as they potentially defeat the
> goal of improving clarity. But, that's a breaking change.

I know that, but I'm saying that hardly *any* code would break if we
made DMD reject things like this. I don't think anybody in their right
mind would write code like that. (Unless they were competing in the
IODCC... :-P)

The issue here is that when specs / DMD / TDPL don't agree, then it's
not always clear which among the three are wrong. Perhaps *all* of them
are wrong. Just because DMD accepts invalid code doesn't mean it should
be part of the specs, for example. It could constitute a DMD bug.

> >But if you want to accept "strange" literals like 0b__1__, you could
> >do something like:
> >
> ><binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit>
> ><underscoreBinaryDigits>
> >
> ><underscoreBinaryDigits> ::= "_"
> >		| "_" <underscoreBinaryDigits>
> >		| <binaryDigit>
> >		| <binaryDigit> <underscoreBinaryDigits>
> >		| ""
> >
> ><binaryDigit> ::= "0"
> >		| "1"
> >
> >The odd form of the rule for <binaryLiteral> is to ensure that
> >there's at least one binary digit in the string, whereas
> ><underscoreBinaryDigits> is just a wildcard anything-goes rule that
> >takes any combination of 0, 1, and _, including the empty string.
> >
> 
> The rule that matches the DMD compiler is actually very easy to do
> in ANTLR4, i.e.
> 
> BinaryLiteral   : '0b' [_01]* [01] [_01]* ;
> 
> 
> I'm a bit too tired to fully pay attention, but it seems you are
> saying that "0b" (no additional numbers) should match, which I believe
> it should not (although I admit to not testing this). If it does then
> I would consider that a bug.

No, the BNF rules I wrote are equivalent to your ANTLR4 spec. Which is
equivalent to the regex I posted later.

> It's not a problem implementing the rule, I am more concerned with
> documenting it in a clear and unambiguous way so that people
> building tools from it can get it right. BNF isn't always the
> easiest way to do so, but it's what being used.

Well, you could bug Walter about what *should* be accepted, and if he
agrees to restrict it to having _ only between two digits, then you'd
file a bug against DMD. Again, I seriously doubt that such a change
would cause any code breakage, because writing 0b1 as 0b____1____ is
just so ridiculous that any such code *should* be broken.

T

-- 
Prosperity breeds contempt, and poverty breeds consent. -- Suck.com