Use SIMD to accelerate comment lexing

Thu Jun 4 14:44:46 PDT 2015

On Thursday, 4 June 2015 at 18:39:02 UTC, Walter Bright wrote:
> On 6/3/2015 7:05 PM, deadalnix wrote:
>> On Wednesday, 3 June 2015 at 22:50:52 UTC, Walter Bright wrote:
>>> On 6/2/2015 5:45 PM, deadalnix wrote:
>>>> You go though character and look for a '/'. When you hit 
>>>> one, you check if the
>>>> character before it is a *, and if so, you have the end of 
>>>> the comment. There is
>>>> obviously various edges cases to take into account, but that 
>>>> is the general
>>>> idea.
>>> Line numbers have to be kept track of as well.
>>
>> They retrieve line number lazily when needed, with various 
>> mechanism to speedup
>> the lookup.
>
> Hmm. There's no way to get the line number without counting 
> LFs, and that means searching for them.

Yes, the first time you query file number, clang build metadata 
about new line by going through the file's content and finding 
position of new lines. The process uses vector operation as well.

Apparently, they think it is better to do that way for various 
reasons:
  - Position tracking is more compact (and position is embedded in 
all expression, declaration, and more) which reduce memory 
footprint bu quite a lot.
  - It makes the lexer simpler and faster.
  - You don't need to track new lines if you don't use them. If 
you don't emit debug infos in C++, and have no error, most line 
number are not used (not sure in D, because various language 
facilities like bound checking uses line number, but that is a 
win in C++).
  - Debug emission have some predictable access pattern, and 
algorithm to find line number from an offset in the file are 
special cased to handle it.
  - Finding new line can be vectorized on the whole file. t cannot 
be vectorized when done in // with lexing.

Once again, I'm not sure this is a win in D, because we need line 
number more than in C++, but it seems to be a win in C++.