[OT] All your medians are belong to me

Patrick Schluter via Digitalmars-d digitalmars-d at puremagic.com
Mon Nov 21 12:36:34 PST 2016


On Monday, 21 November 2016 at 18:39:26 UTC, Andrei Alexandrescu 
wrote:
> On 11/21/2016 01:18 PM, jmh530 wrote:
>> I would just generate a bunch of integers randomly and use 
>> that, but I
>> don't know if you specifically need to work with strings.
>
> I have that, too, but was looking for some real data as well. 
> It would be a nice addition. -- Andrei

I don't really know what kind of data you would need but there 
are the European Unions Language Technology Resources corpuses 
made available for the research community. There are several 
different data sets in different formats (documents, alignments, 
xml) and in all European languages that can be used for 
experiments and real world use. The data is public domain and is 
free to use. The DGT-TM dataset is compiled by myself and updated 
yearly. It consist of around 12 billion characters or 1.8 billion 
words or 111 million segments in 28 languages.

https://ec.europa.eu/jrc/en/language-technologies


More information about the Digitalmars-d mailing list