Dscanner - It exists

Marco Leise Marco.Leise at gmx.de
Wed Aug 1 13:34:14 PDT 2012


Am Wed, 01 Aug 2012 19:58:46 +0200
schrieb "Brian Schott" <briancschott at gmail.com>:

> On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
> >
> > I suggest proposing the D lexer as an addition to Phobos. But 
> > if that is done, its interface would need to accept a range as 
> > input, and its output should be a range of tokens.
> 
> It used to be range-based, but the performance was terrible. The 
> inability to use slicing on a forward-range of characters and the 
> gigantic block on KCachegrind labeled "std.utf.decode" were the 
> reasons that I chose this approach. I wish I had saved the 
> measurements on this....

I can understand you. I was reading a dictionary file with readText().splitLines(); and wondering why a unicode decoding was performed. Unfortunately ranges work on Unicode units and all structured text files are structured by ASCII characters. While these file formats probably just old or done with some false sense of compatibility in mind, it is also clear to their inventors, that parsing them is easier and faster with single-byte characters to delimit tokens.
But we have talked about UTF-8 vs. ASCII and foreach vs. ranges before. I still hope for some super-smart solution, that doesn't need a book of documentation and allows some kind of ASCII-equivalent range. I've heard that foreach over UTF-8 with a dchar loop variable, does an implicit decoding of the UTF-8 string. While this is useful it is also not self-explanatory and needs some reading into the topic.

-- 
Marco



More information about the Digitalmars-d-announce mailing list