D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Thu Mar 8 10:03:12 PST 2012

On 08.03.2012 11:48, Jonathan M Davis wrote:
> On Thursday, March 08, 2012 08:21:17 Zach the Mystic wrote:
>> On Thursday, 8 March 2012 at 04:56:07 UTC, Jonathan M Davis wrote:
>>> If you took it from ddmd, then it's definitely going to have to
>>> be GPL.
>>>
>>> Now, there is interest in having a D parser and lexer in
>>> Phobos. I don't know
>>> if your version will fit the bill (e.g. it must have a
>>> range-based API), but we
>>> need one at some point. The original idea was to more or less
>>> directly port
>>> dmd's lexer and parser with some adjustments to the API as
>>> necessary
>>> (primarily to make it range-based). But no one has had the time
>>> to complete
>>> such a project yet (I originally volunteered to do it, but I
>>> just haven't had
>>> the time).
>>>
>>> When that project was proposed, Walter agreed to let that port
>>> be Boost rather
>>> than GPL (since he holds the copyright and the port would be
>>> going in Phobos,
>>> which uses boost).
>>>
>>> The problem with what you have (even if the API and
>>> implementation were
>>> perfect) is that it comes from ddmd, which had other
>>> contributors working on
>>> it. So, you would have to get permission from not only Walter
>>> but all of the
>>> relevant ddmd contributors. If you were able to _that_, and it
>>> could get
>>> passed the review process, then what you've done could be put
>>> into Phobos. But
>>> that requires that you take the time and effort to take care of
>>> getting the
>>> appropriate permissions, making sure that the API and
>>> implementation are
>>> acceptable for Phobos, and putting it through the Phobos review
>>> process. It
>>> would be great if you could do that though.
>>>
>>> - Jonathan M Davis
>>
>> This is great news. I was really worried that the license was
>> etched in stone. I'll need help finding out who owns the code,
>> plus legal advice if the process is more than just getting a
>> simple confirmation email from each of the original authors.
>>
>> I have some comments I feel are very interesting regarding the
>> lexer and pointers. There are no pointers in any of the code
>> besides the lexer, so I think that will be very satisfying to
>> you. Now I don't know everything about ranges, but if you simply
>> mean dynamic arrays, then yes, everything except the lexer uses
>> arrays when necessary, although there's simply a lot of code
>> which doesn't need them because most of the objects are really
>> just lists of members, many of which are not arrays.
>>
>> About the lexer, one thing I realized about the Wild-West pointer
>> style as I was porting it is that it must be blazing fast. To my
>> understanding, to call p.popFront() requires two operations, ++p;
>> followed by --p.length; plus possibly array bounds checking, I
>> don't know.
>>
>> ++p is all that the current lexer needs. It used to only check
>> for EOF at each junction, but since I'm parsing little chunks of
>> code instead of whole files now, it checks "if ( p>= endBuf )"
>> at the beginning of each token scan, which gets pretty close to
>> not going out of bounds, since most tokens aren't very long. That
>> lexer is a tribute to very fast programming of an old school
>> which will go away if it changes. Still, I can sense a tidal wave
>> of RANGES coming, and I fear I'll just have to bid the little
>> thing goodbye! :-(
>
> A range is not necessarily a dynamic array, though a dynamic array is a range.
> The lexer is going to need to take  a range of dchar (which may or may not be
> an array), and it's probably going  to need to return a range of tokens. The
> parser would then take a range of  tokens and then output the AST in some form
> or other - it probably couldn't be  range, but I'm not sure. And while the
> lexer would need to operate on generic ranges of dchar, it would probably have
> to be special-cased for strings in a number of places in order to make it
> faster (e.g. checking the first char in a string rather than using front when
> it's known that the value being checked against is an ASCII character and will
> therefore fit in a single char - front has to decode the next character, which
> is less efficient).

Simply put, the decisison on decoding should belong to lexer. Thus 
strings should be wrapped as input range of char, wchar & dchar 
respectively.

>
> So, if you're not familiar with ranges, you probably have a fair bit of
> learning ahead of you, and you're probably going to have to make a number of
> changes to your lexer and parser (though the majority of it will probably be
> able to stay intact). Unfortunately, a proper article and tutorial on them is
> currently lacking in spite of the fact that Phobos uses them heavily.
> Fortunately however, in a book that Ali Çehreli is writing on D, he has a
> chapter on ranges that should help get you started:
>
> http://ddili.org/ders/d.en/ranges.html
>
> But I'd suggest that you play around with ranges a fair bit (especially with
> strings) before trying to change what you have to use them. std.algorithm in
> particular makes heavy use of ranges. And it wouldn't surprise me at all if
> some portions of your lexer and parser really should be using some of Phobos'
> functions but isn't currently, because it's originally a port from C++. You
> should also make sure that you understand the basics of Unicode fairly well -
> especially with how they pertain to char, wchar, and dchar - since that will
> affect your ability to correctly translate code to use ranges as well as
> properly optimize them.
>
> It would probably help if other D developers who are more familiar with ranges
> took a look at what you have and maybe even helped you start adjusting your
> code, but I don't know how many will both have the time and be interested. If
> I have time, I'll probably start poking at it, but I don't know that I'll have
> time any time soon, much as I'd like to.
>
> Regardless, you need to familiarize yourself with ranges if you want to get
> the lexer and parser ready for inclusion in Phobos. And you really should
> familiarize yourself with them anyway, since they're heavily used in D code in
> general. Not being able to use ranges in D would be like not being able to use
> iterators in C++. You can program in it, but you'd be fairly crippled -
> particularly when dealing with the standard library.
>
> - Jonathan M Davis

-- 
Dmitry Olshansky