Faster Command Line Tools in D
Patrick Schluter via Digitalmars-d-announce
digitalmars-d-announce at puremagic.com
Tue May 30 22:09:47 PDT 2017
On Tuesday, 30 May 2017 at 22:31:50 UTC, Steven Schveighoffer
wrote:
> On 5/30/17 5:57 PM, Patrick Schluter wrote:
>> On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer
>> wrote:
>>> On 5/26/17 11:20 AM, John Colvin wrote:
>>>> On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
>>>>> [...]
>>>>
>>>> This version also has the advantage of being (discounting
>>>> any bugs in
>>>> iopipe) correct for arbitrary unicode in all common UTF
>>>> encodings.
>>>
>>> I worked a lot on making sure this works properly. However,
>>> it's
>>> possible that there are some lingering issues.
>>>
>>> I also did not spend much time optimizing these paths
>>> (whereas I spent
>>> a ton of time getting the utf8 line parsing as fast as it
>>> could be).
>>> Partly because finding things other than utf8 in the wild is
>>> rare, and
>>> partly because I have nothing to compare it with to know what
>>> is
>>> possible :)
>>
>> If you want UCS-2 (aka UTF-16 without surrogates) data I can
>> give you
>> gigabytes of files in tmx format.
>
> The data I can (and have) generated from UTF-8 data. I have
> tested my byLine parser to make sure it properly splits on
> "interesting" code points in all widths. UTF-16 data without
> surrogates should probably work fine. I haven't tuned it though
> like I tuned the UTF-8 version. Is there a memchr for wide
> characters? ;)
>
> What I really haven't done is compared my line parsing code
> with multi-code-unit delimiters against one that can do the
> same thing. I know Phobos and C FILE * really can't do it. I
> haven't really looked at all in C++, so I should probably look
> there before giving up.
>
> -Steve
In any case, you can download the dataset from [1] if you like.
There are several 100 Mb big zip files containing a collection of
tmx files (translation memory exchange) with European
Legislation. The files contain multi-alignment texts in up to 24
languages. The files are encoded in UCS-2 little-endian. I know
for a fact (because I compiled the data) that they don't contain
characters outside of the BMP. The data is public and can be used
freely (as in beer).
When I get some time, I will try to port the java app that is
distributed with it to D (partially done yet).
[1]:
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
More information about the Digitalmars-d-announce
mailing list