[OT] All your medians are belong to me

Andrei Alexandrescu via Digitalmars-d digitalmars-d at puremagic.com
Mon Nov 21 09:39:40 PST 2016


Hey folks, I'm working on a paper for fast median computation and 
https://issues.dlang.org/show_bug.cgi?id=16517 came to mind. I see the 
Google ngram corpus has occurrences of n-grams per year. Is data 
aggregated for all years available somewhere? I'd like to compute e.g. 
"the word (1-gram) with the median frequency across all English books" 
so I don't need the frequencies per year, only totals.

Of course I can download the entire corpus and then do some processing, 
but that would take a long time.

Also, if you can think of any large corpus that would be pertinent for 
median computation, please let me know!


Thanks,

Andrei


More information about the Digitalmars-d mailing list