Hi Guys,
We were recently playing around with NGrams to see if there was any advantage they could give us in NLG and other areas. Unfortunatley, for Caesar at least, they don’t seem to offer any benefit in this discipline.
However, we now have a pre-processed behemoth of NGram data that if is useful to any of you guys, we will put online for you to download.
The original Google NGram datasets are rather large (around 40TB for everything compressed alone) and unless you wanted to know year frequencies of these ngrams, most of that data is redundant.
We pulled everything and processed it into a simple “nGram -> total count” format that reduces the size from TB, to GB per collection. That makes it managable for those on more modest connections.
We couldn’t find anywhere other than the LDS resource a simplified dataset of this, and they wanted $150 for it!
Shout out and if theres enough of you, we’ll throw it all on a server for your downloading pleasure.