[OT] All your medians are belong to me

jmh530 via Digitalmars-d digitalmars-d at puremagic.com
Mon Nov 21 10:18:53 PST 2016


On Monday, 21 November 2016 at 17:39:40 UTC, Andrei Alexandrescu 
wrote:
> Hey folks, I'm working on a paper for fast median computation 
> and https://issues.dlang.org/show_bug.cgi?id=16517 came to 
> mind. I see the Google ngram corpus has occurrences of n-grams 
> per year. Is data aggregated for all years available somewhere? 
> I'd like to compute e.g. "the word (1-gram) with the median 
> frequency across all English books" so I don't need the 
> frequencies per year, only totals.
>
> Of course I can download the entire corpus and then do some 
> processing, but that would take a long time.
>
> Also, if you can think of any large corpus that would be 
> pertinent for median computation, please let me know!
>
>
> Thanks,
>
> Andrei

You might following worthwhile.

http://opendata.stackexchange.com/questions/6114/dataset-for-english-words-of-dictionary-for-a-nlp-project

I would just generate a bunch of integers randomly and use that, 
but I don't know if you specifically need to work with strings.


More information about the Digitalmars-d mailing list