VLERange: a range in between BidirectionalRange and RandomAccessRange
spir
denis.spir at gmail.com
Mon Jan 17 10:37:37 PST 2011
On 01/17/2011 04:00 PM, Andrei Alexandrescu wrote:
> On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
>> We need to get some real numbers together. I'll see what I can create
>> for a type, but someone else needs to supply the input :) I'm on short
>> supply of unicode data, and any attempts I've made to create some result
>> in failure. I have one example of one composed character in this thread
>> that I can cling to, but in order to supply some real numbers, we need a
>> large amount of data.
>
> Oh, one more thing. You don't need a lot of Unicode text containing
> combining characters to write benchmarks. (You do need it for testing
> purposes.) Most text won't contain combining characters anyway, so after
> you implement graphemes, just benchmark them on regular text.
Correct. For this reason, we do not use the same source at all for
correctness and performance testing.
It is impossible to define typical or representative source (who
judges?) But at very minimum, source texts for perf measurement should
mix languages as diverse as possible, including some material of the
ones known to be problematic and/or atypical (english, korean, hebrew...)
The following (ripped and composed from ICU data sets) is just that:
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt
Content:
12 natural languages
34767 bytes = utf8 code units
--> 20133 code points
--> 22033 normal codes (NFD decomposed)
--> 19205 piles = true characters
Denis
_________________
vita es estrany
spir.wikidot.com
More information about the Digitalmars-d
mailing list