struct vs class for a simple token in my d lexer

Mon May 14 09:39:37 PDT 2012

On Monday, 14 May 2012 at 15:53:34 UTC, Tobias Pankrath wrote:
> Quoting your post in another thread:
>
> On Monday, 14 May 2012 at 15:10:25 UTC, Roman D. Boiko wrote:
>> Making it a class would give several benefits:
>
>>* allow not to worry about allocating a big array of tokens. 
>>E.g., on 64-bit OS the largest module in Phobos (IIRC, the 
>>std.datetime) consumes 13.5MB in an array of almost 500K 
>>tokens. It would require 4 times smaller chunk of contiguous 
>>memory if it was an array of class objects, because each would 
>>consume only 8 bytes instead of 32.
>
> You'll still have count the space the tokens claim on the heap. 
> So it's
> basically the 500k tokens plus 500k references. I'm not sure, 
> why you would need such a big array of tokens, though.
>
> Aren't they produced by the lexer to be directly consumed and 
> discarded by the parser?
I use sorted array of tokens for efficient 0(log N) lookup by its 
index (the first code unit of token). (Since tokens are created 
in increasing order of start indices, no further sorting is 
needed.) Lookup is used for two purposes:
* find the token corresponding to location of cursor (e.g., for 
auto-complete)
* combined with 0(log M) lookup in the ordered array of first 
line code unit indices, calculate the Location (line number and 
column number) for start / end of a token on demand (they are not 
pre-calculated because not used frequently); this approach also 
makes it easy to calculate Location either taking into account 
special token sequences (#line 3 "ab/c.d"), or ignoring them.

>> * allow subclassing, for example, for storing strongly typed 
>> literal values; this flexibility could also facilitate future 
>> extensibility (but it's difficult to predict which kind of 
>> extension may be needed)
>
> If performance matters, why would you subclass and risk a 
> virtual method call for something as basic as tokens?
Agree, but not sure. That's why I created this thread.

>> * there would be no need to copy data from tokens into AST, 
>> passing an object would be enough (again, copy 8 instead of 32 
>> bytes); the same applies to passing into methods - no need to 
>> pass by ref to minimise overhead
>
> I'm using string to store source content in tokens. Because of 
> the way string in D works, there is no need for data copies.
The same do I. But size of string field is still 16 bytes (half 
of my token size).

>> These considerations are mostly about performance. I think 
>> there is also some impact on design, but couldn't find 
>> anything significant (given that currently I see a token as 
>> merely a datastructure without associated behavior).
>
> IMO token are value types.
The value type might be implemented as struct or class.