Let's stop parser Hell

Wed Aug 1 02:21:27 PDT 2012

On Wednesday, August 01, 2012 11:14:52 Jacob Carlborg wrote:
> On 2012-08-01 08:11, Jonathan M Davis wrote:
> > I'm not using regexes at all. It's using string mixins to reduce code
> > duplication, but it's effectively hand-written. If I do it right, it
> > should be _very_ difficult to make it any faster than it's going to be.
> > It even specifically avoids decoding unicode characters and operates on
> > ASCII characters as much as possible.
> 
> That's good idea. Most code can be treated as ASCII (I assume most
> people code in english). It would basically only be string literals
> containing characters outside the ASCII table.

What's of particular importance is the fact that _all_ of the language 
constructs are ASCII. So, unicode comes in exclusively with identifiers, string 
literals, char literals, and whitespace. And with those, ASCII is still going 
to be far more common, so coding it in a way that makes ASCII faster at slight 
cost to performance for unicode is perfectly acceptable.

> BTW, have you seen this:
> 
> http://woboq.com/blog/utf-8-processing-using-simd.html

No, I'll have to take a look. I know pretty much nothing about SIMD though. 
I've only heard of it, because Walter implemented some SIMD stuff in dmd not 
too long ago.

- Jonathan M Davis