VLERange: a range in between BidirectionalRange and RandomAccessRange

Mon Jan 17 10:37:37 PST 2011

On 01/17/2011 04:00 PM, Andrei Alexandrescu wrote:
> On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
>> We need to get some real numbers together. I'll see what I can create
>> for a type, but someone else needs to supply the input :) I'm on short
>> supply of unicode data, and any attempts I've made to create some result
>> in failure. I have one example of one composed character in this thread
>> that I can cling to, but in order to supply some real numbers, we need a
>> large amount of data.
>
> Oh, one more thing. You don't need a lot of Unicode text containing
> combining characters to write benchmarks. (You do need it for testing
> purposes.) Most text won't contain combining characters anyway, so after
> you implement graphemes, just benchmark them on regular text.

Correct. For this reason, we do not use the same source at all for 
correctness and performance testing.
It is impossible to define typical or representative source (who 
judges?) But at very minimum, source texts for perf measurement should 
mix languages as diverse as possible, including some material of the 
ones known to be problematic and/or atypical (english, korean, hebrew...)
The following (ripped and composed from ICU data sets) is just that: 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt

Content:
12 natural languages
34767 bytes = utf8 code units
--> 20133 code points
--> 22033 normal codes (NFD decomposed)
--> 19205 piles = true characters

Denis
_________________
vita es estrany
spir.wikidot.com