AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Processed Google NGrams useful to anybody?
 
 

Hi Guys,

We were recently playing around with NGrams to see if there was any advantage they could give us in NLG and other areas.  Unfortunatley, for Caesar at least, they don’t seem to offer any benefit in this discipline.

However, we now have a pre-processed behemoth of NGram data that if is useful to any of you guys, we will put online for you to download.

The original Google NGram datasets are rather large (around 40TB for everything compressed alone) and unless you wanted to know year frequencies of these ngrams, most of that data is redundant.

We pulled everything and processed it into a simple “nGram -> total count” format that reduces the size from TB, to GB per collection.  That makes it managable for those on more modest connections.

We couldn’t find anywhere other than the LDS resource a simplified dataset of this, and they wanted $150 for it!

Shout out and if theres enough of you, we’ll throw it all on a server for your downloading pleasure.

 

 
  [ # 1 ]

I’d be interested. My bot uses ngrams to assist in POS tagging and this might be useful. Can you give some more details about what the data looks like? (An example of how the ngrams themselves are structured would be nice.)

 

 
  [ # 2 ]

Sure.

ascore_DET of_ADP 43
asserting statutory 149
as_ADV complicated 51340
as Masonic_ADJ 1617
as Tenite 65
as_ADP patrilineal 2196
as_ADP transversals 94
Assumptions used 1420
as Format_ADJ 220
aster_ADP continuing_VERB 52
as selfindulgent_ADJ 438
asymmetric relationships_NOUN 1552
Assam_NOUN ....._NUM 43
Asian front 1177
as_ADP precinct 2044
as Sealink 60
assessee otherwise_ADV 67
Association cookbook_NOUN 59
as logographs_NOUN 165
As_ADP refreshments 49
as_ADP mercurous_NOUN 1250
Asian grouping_VERB 194

They are POS tagged already where possible, start and end of sentences are also tagged with _START_ and _END_ labels within the NGrams.  Files themselves are just tab delimited nGram \t count \n.  Words within the nGrams are space delimited.

 

 
  login or register to react