Of possible interest: fast UTF8 validation
Patrick Schluter
Patrick.Schluter at bbox.fr
Thu May 17 18:23:05 UTC 2018
On Thursday, 17 May 2018 at 15:37:01 UTC, Andrei Alexandrescu
wrote:
> On 05/17/2018 09:14 AM, Patrick Schluter wrote:
>> I'm in charge at the European Commission of the biggest
>> translation memory in the world.
>
> Impressive! Is that the Europarl?
No, Euramis. The central translation memory developed by the
Commission and used also by the other institutions. The database
contains more than a billion segments from parallel texts and is
afaik the biggest of its kind. One of the big strength of the
Euramis TM is its multi-target language store this allows fuzzy
searches in all combinations including indirect translations
(i.e. if a document written in english was translated in Romanian
and in Maltese it is then possible to search for alignments
between ro and mt). It's not the only system to do that but on
that volume it is quite unique.
We publish also every year an extract of it of the published
legislation [1] from the official journal so that they can be
used by the research community. All the machine translation
engines use it. It is one of most accessed data collection on the
European Open Data portal [2].
The very uncommon thing about the backend software of EURAMIS is
that it is written in C. Pure unadultered C. I'm trying to
introduce D but with the strange (to say it politely)
configurations our server have it is quite challenging.
[1]:
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
[2]: http://data.europa.eu/euodp/fr/data
More information about the Digitalmars-d
mailing list