Of possible interest: fast UTF8 validation

Thu May 17 18:23:05 UTC 2018

On Thursday, 17 May 2018 at 15:37:01 UTC, Andrei Alexandrescu 
wrote:
> On 05/17/2018 09:14 AM, Patrick Schluter wrote:
>> I'm in charge at the European Commission of the biggest 
>> translation memory in the world.
>
> Impressive! Is that the Europarl?

No, Euramis. The central translation memory developed by the 
Commission and used also by the other institutions. The database 
contains more than a billion segments from parallel texts and is 
afaik the biggest of its kind. One of the big strength of the 
Euramis TM is its multi-target language store this allows fuzzy 
searches in all combinations including indirect translations 
(i.e. if a document written in english was translated in Romanian 
and in Maltese it is then possible to search for alignments 
between ro and mt). It's not the only system to do that but on 
that volume it is quite unique.
We publish also every year an extract of it of the published 
legislation [1] from the official journal so that they can be 
used by the research community. All the machine translation 
engines use it. It is one of most accessed data collection on the 
European Open Data portal [2].

The very uncommon thing about the backend software of EURAMIS is 
that it is written in C. Pure unadultered C. I'm trying to 
introduce D but with the strange (to say it politely) 
configurations our server have it is quite challenging.

[1]: 
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
[2]: http://data.europa.eu/euodp/fr/data