AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

English verbs
 
 

This week I’ve been working on a natural language parser for English, specifically on verb forms. So far I’ve covered number, person, tense, aspect and polarity and it has resulted in hundreds of distinct forms for each verb, and many of these forms can be written in a number of different ways (e.g. shall not, will not, won’t, shan’t)!

I still have voice, mood, verbals and transitivity to do.

Here’s a small sample:

(ask :repetitive :past-perfect :third :plural t) ((((had) (kept))) ((on (asking))))
(ask :repetitive :past-perfect :third :plural t) ((((had) (kept))) (asking))
(ask :repetitive :past-perfect :third :plural :not) (((((hadn’t (had not))) (kept))) ((on (asking))))
(ask :repetitive :past-perfect :third :plural :not) (((((hadn’t (had not))) (kept))) (asking))
(ask :inchoative :past-perfect :third :plural t) ((((had) (got))) (asking))
(ask :inchoative :past-perfect :third :plural t) ((((had) (got))) (asked))
(ask :inchoative :past-perfect :third :plural t) ((((had) (been))) ((going ((to (ask))))))
(ask :inchoative :past-perfect :third :plural :not) (((((hadn’t (had not))) (got))) (asking))
(ask :inchoative :past-perfect :third :plural :not) (((((hadn’t (had not))) (got))) (asked))
(ask :inchoative :past-perfect :third :plural :not) (((((hadn’t (had not))) (been))) ((going ((to (ask))))))
(ask :progressive :past-perfect :third :plural t) ((((had) (been))) (asking))
(ask :progressive :past-perfect :third :plural :not) (((((hadn’t (had not))) (been))) (asking))
(ask :repetitive :future :third :plural t) ((((will) (keep)) ((shall) (keep))) ((on (asking))))
(ask :repetitive :future :third :plural t) ((((will) (keep)) ((shall) (keep))) (asking))
(ask :repetitive :future :third :plural :not) ((((won’t (will not)) (keep)) ((shan’t (shall not)) (keep))) ((on (asking))))
(ask :repetitive :future :third :plural :not) ((((won’t (will not)) (keep)) ((shan’t (shall not)) (keep))) (asking))
(ask :inchoative :future :third :plural t) ((((will) (get)) ((shall) (get))) (asking))
(ask :inchoative :future :third :plural t) ((((will) (get)) ((shall) (get))) (asked))
(ask :inchoative :future :third :plural t) ((((will) (be)) ((shall) (be))) ((going ((to (ask))))))
(ask :inchoative :future :third :plural :not) ((((won’t (will not)) (get)) ((shan’t (shall not)) (get))) (asking))
(ask :inchoative :future :third :plural :not) ((((won’t (will not)) (get)) ((shan’t (shall not)) (get))) (asked))
(ask :inchoative :future :third :plural :not) ((((won’t (will not)) (be)) ((shan’t (shall not)) (be))) ((going ((to (ask))))))
(ask :progressive :future :third :plural t) ((((will) (be)) ((shall) (be))) (asking))
(ask :progressive :future :third :plural :not) ((((won’t (will not)) (be)) ((shan’t (shall not)) (be))) (asking))

 

 
  [ # 1 ]

O.O

That’s a lot of “code” for just one word, with still more to go, and I see (or rather, don’t see) a few other variations that could also probably be included. Grammar isn’t my “strong suit”, as far as being able to enumerate semantics in that manner, but my brain seems able, and it tells me that some things are missing. smile

For example, in American English, “had got asked”, while possibly “proper grammar”, is more or less unused, while “had been asked” is much more common. Also, the contraction for “shall not” is rarely used, and the contraction for “will not” is used, instead. Personally, I like the word Shan’t. It’s got character. smile

I don’t envy you the task of mapping the semantics of words in this way. It looks to be a really big challenge. Of course I have no doubt that you’re more than equal to the task, but putting myself in your shoes with regard to this project has caused my brain to singe about the edges.

 

 
  [ # 2 ]

I would like a program I could ask: “What is the past tense of be?” and it would return “was”. “What is the present participle of ‘ask’?” -> “asking”. “What is the past tense of ‘go’?” -> “went” and so on.

Then I could include it as an agent, and have other agents query it when they needed information about a verb for whatever processing they were doing.

I would also like the agent to be able to learn new forms and change existing forms. So I could teach it “the past tense of ‘lay’ is ‘laid’, the past participle of ‘lie’ is ‘lain’ or ‘lied’”, in case it didn’t know those or had them wrong.

 

 
  [ # 3 ]

Andrew, it seems to me that many of these conjugations and tenses are common for most verbs. I wonder if it is worthwhile producing an exhaustive list for each verb, or if the verb need only reference which rules it has been seen adhering to (either from scanning text or from human interaction). At any rate, I applaud your effort. This seems like a massive undertaking. smile

I read a paper a while ago (I can search again for it if you’re interested) that discussed the way that people learn grammatical forms. The main question the paper sought to address was why people could learn new grammatical rules so quickly, with only a small set of examples. For example, some verbs are only found followed by specific prepositions. Some only appear in infinitive phrases while others only in participial. Some only appear in certain contexts.

The authors concluded that people encountering a new word would use whatever the most common grammatical forms would dictate until they encounter counter-examples or correction. The barrier for “belief” in the new rule is very low, and people then tend to only use the learned rule (and any other rules known to be associated with the learned rule).

 

 
  [ # 4 ]

Isn’t this a question of strong and week verbs. The latter simply follows a default set of rules. Strong verbs need to be learned. Also, the more time that passes and the more influences a language has from outside, the faster ‘strong’ verbs disappear.

I’d also leave out all the ‘be’, ‘have’, ‘go’, ‘do’ conjugations. Same for not, which is only needed for the strong verbs.

 

 
  [ # 5 ]
Dave Morton - Nov 25, 2011:

That’s a lot of “code” for just one word, with still more to go, and I see (or rather, don’t see) a few other variations that could also probably be included. Grammar isn’t my “strong suit”, as far as being able to enumerate semantics in that manner, but my brain seems able, and it tells me that some things are missing.

For example, in American English, “had got asked”, while possibly “proper grammar”, is more or less unused, while “had been asked” is much more common. Also, the contraction for “shall not” is rarely used, and the contraction for “will not” is used, instead. Personally, I like the word Shan’t. It’s got character.

I don’t envy you the task of mapping the semantics of words in this way. It looks to be a really big challenge. Of course I have no doubt that you’re more than equal to the task, but putting myself in your shoes with regard to this project has caused my brain to singe about the edges.

I’ve attached a more complete list for the verb “ask” with over 400 distinct uses. There are still many more to add as I haven’t yet considered voice (active, passive) or all the different subjunctive formats (e.g. might, could, must, must not, etc) or verbals (e.g. being asked, likes asking). There are also a few interrogative formats missing such as those using who, which, what, why, where and when.

The goal with this list is to be as exhaustive as possible. Certainly there are many formats which are no longer used, such as the example you gave (“had got asked” rather than “had been asked”) but it is necessary to include them all so the parser can cope with anything meaningful that it is given. I’ll also be including many ungrammatical forms and malformed phrases for the same reason (e.g. ain’t).

As for the difficulty of compiling the list, it is actually a lot easier than you might think.

C R Hunt - Nov 25, 2011:

Andrew, it seems to me that many of these conjugations and tenses are common for most verbs. I wonder if it is worthwhile producing an exhaustive list for each verb, or if the verb need only reference which rules it has been seen adhering to (either from scanning text or from human interaction). At any rate, I applaud your effort. This seems like a massive undertaking.

I am only creating this list for one verb, and I am not even creating it manually. I’ve written some Common Lisp software which takes all the rules for composing English sentences (there is a finite number of rules) and generates an example for each distinct case. I can then inspect the list for errors and make suitable corrections to the program.

Once I have a complete list of correct phrases, I can then use the same program that generates the phrases to generate the rules for parsing them. This also defines all the lookup tables that ascribe meaning to each statement, so really, none of this is all that difficult to do.

 

File Attachments
ask-variations.text  (File Size: 8KB - Downloads: 162)
 

 
  [ # 6 ]
Andrew Smith - Nov 25, 2011:

I am only creating this list for one verb, and I am not even creating it manually. I’ve written some Common Lisp software which takes all the rules for composing English sentences (there is a finite number of rules) and generates an example for each distinct case. I can then inspect the list for errors and make suitable corrections to the program.

Once I have a complete list of correct phrases, I can then use the same program that generates the phrases to generate the rules for parsing them. This also defines all the lookup tables that ascribe meaning to each statement, so really, none of this is all that difficult to do.

Ah, I understand now. Thanks for the clarification. smile

If your parser will work on the level of phrases, I’m curious how you plan to approach recognizing them in text. Obviously there will be overlap—words that can belong to one phrase or another or to the main sentence, as well as phrases that act as one part of speech within another phrase. Right now I’m recognizing these by taking all grammatical combinations and then doing multiple passes to pull out phrases within phrases, etc. (Then making sure there’s at least one main sentence that remains!)

You can imagine how many parses one ends up with given a sentence of some complexity. The real challenge has been throwing away parses efficiently. I haven’t even toyed yet with ranking the remains based on context. (Particularly, how plausible is a given subject-verb or subject-verb-DO combination.) You could probably incorporate this much more swiftly, given your work with YAGO2. Any thoughts on this?

 

 
  [ # 7 ]

Dear Andrew,

A few comments/observations. My apologies in advance if what I say is not relevant to what you are trying to accomplish:

(1) Regular verbs can be completely specified by their three principle parts plus an “ing” form. For example: sink/sank/sunk/sinking allows you to generate present, simple past, simple future, perfect tenses, gerund, and participle. Irregular verb specifications can be limited to listing those cases which disagree with the regular format.

(2) A deeper understanding of English indicates that the organization in (1) is not actually as complete as it first appears. For example, the simple future tense is specified by “will” as in “will sink”. But, is this really distinct from “can sink”, “may sink”, “could sink”, etc. In other words, there are a host of specialized adverbs that pertain to future time and possibility that are interchangeable with “will” and seem as worthy of being classified as a “tense”.

(3) Once you open the Pandora’s box of words being inserted into verb phrases, it brings up the annoying (though legal) possibility that input sentences will break up said phrases with various adverbs, prepositional phrases, and other annoyances. e.g., “I will most certainly come to dinner.” or “I have with all my heart tried to win.” The point is that a parser has no small difficulty in determining whether “I have” is the end of the verb as in “I have two hands” or is just the beginning of a longer verb construction whose tail lies down the road somewhere in the sentence in question.

Such is the lot of natural language programmer ... good luck!

 

 
  [ # 8 ]

Thanks Eulalio, your comments are entirely relevant. I’ll try to describe some of the rules that I’m using here. It would have been easier to just paste the Common Lisp code as it is quite succinct, but it doesn’t format well in this medium and some folks are scared of parentheses.

For English, every verb has a number of principal parts:

infinitive be eat walk
present
-participle being eating walking
past
-participle been eaten walked
present
-third-singular is eats walks
preterite 
=past-participle was ate =walked
past
-plural =preterite were =ate =walked
present
-plural =infinitive are =eat =walk
present
-first-singular =infinitive am =eat =walk 

The principal parts of regular verbs comprise only four distinct inflections (walk, walking, walked, walks) whereas irregular verbs have five (eat, eating, eaten, eats, ate). Only one verb “to be” has eight. Simple present tense and simple past tense can be conjugated using the principal parts alone:

present first singular =present-first-singular i eat
present second singular 
=present-plural you eat
present third singular 
=present-third-singular it eats
present first plural 
=present-plural we eat
present second plural 
=present-plural you eat
present third plural 
=present-plural they eat

past first singular 
=preterite i ate
past second singular 
=past-plural you ate
past third singular 
=preterite it ate
past first plural 
=past-plural we ate
past second plural 
=past-plural you ate
past third plural 
=past-plural they ate 

Other tenses are composed of conjugations of an auxiliary verb combined with a principal-part of the main verb.

future =will infinitive i will eat
past
-perfect =have(past,person,numberpast-participle he has eaten
present
-perfect =have(present,person,numberpast-participle they have eaten
future
-perfect =have(future,person,numberpast-participle you will have eaten
past
-future =be(past,person,numberto infinitive you were to eat
past
-future =would infinitive you would eat 

This much covers the following grammatical categories:

person(first,second,third)
number(singular,plural)
tense(past,present,future,past-perfect,present-perfect,future-perfect,past-future)

I’ve also researched and compiled rules (which I am still refining and debugging) for the additional grammatical categories:

aspect(emphatic,progressive,inchoative,repetitive)
mood(indicative,imperative,interrogative,subjunctive)
voice(active,passive)
polarity(affirmative,negative)

And then there are constructs called verbals which turn verbs into nouns (e.g. I like eating, He provided encouragement). Haven’t started coding those yet so I’m not sure how they’ll turn out.

 

 

 
  [ # 9 ]
C R Hunt - Nov 25, 2011:

If your parser will work on the level of phrases, I’m curious how you plan to approach recognizing them in text. Obviously there will be overlap—words that can belong to one phrase or another or to the main sentence, as well as phrases that act as one part of speech within another phrase. Right now I’m recognizing these by taking all grammatical combinations and then doing multiple passes to pull out phrases within phrases, etc. (Then making sure there’s at least one main sentence that remains!)

You can imagine how many parses one ends up with given a sentence of some complexity. The real challenge has been throwing away parses efficiently. I haven’t even toyed yet with ranking the remains based on context. (Particularly, how plausible is a given subject-verb or subject-verb-DO combination.) You could probably incorporate this much more swiftly, given your work with YAGO2. Any thoughts on this?

I’m using the GLR parser that I’ve been researching and developing for the past few years. You may recall that I posted about it a few months back. I’ve attached the latest versions of an example grammar file for discourse analysis and its XML output to this post.

The parser efficiently handles context free grammars expressed in Chomsky normal form, which is a kind of algebra for languages. This means that the entire parsing process can be encapsulated using a collection of abstract formula and made completely independent of the programming that provides the functionality. The grammar definition is compiled into sets of look up tables and compiled C code which performs the actual parsing at run time.

The parser is extremely efficient. It examines the input one character at a time and determines everything that the input could possibly represent. Then it gets the next character and eliminates from the list of candidates anything that could no longer match the input, and so on. It never has to examine an input character or consider any given possible candidate more than once.

As far as performance and capacity is concerned, I’ve tested it successfully with grammars containing hundreds of thousands of rules parsing text files many megabytes in size, which it can process in a matter of seconds. The attached example, a ten thousand word story, only takes a few milliseconds to break down into XML.

 

File Attachments
eng.grammar  (File Size: 5KB - Downloads: 58)
the-country-of-the-blind.xml.zip  (File Size: 22KB - Downloads: 120)
 

 
  [ # 10 ]

It appears that the file attachment function is failing, Andrew. I’ll submit a bug report on this matter. The zip file downloaded just fine, but the grammar file failed to do so (giving me a blank page several times), so it’s likely due to a failure of the attachment script to determine a correct mime-type for the file. If you can try to zip the file up, and then attach the zip file, that would probably work, or you can try changing the extension to .txt, which would also serve to help diagnose the problem.

 

 
  [ # 11 ]
Dave Morton - Nov 26, 2011:

It appears that the file attachment function is failing, Andrew. I’ll submit a bug report on this matter. The zip file downloaded just fine, but the grammar file failed to do so (giving me a blank page several times), so it’s likely due to a failure of the attachment script to determine a correct mime-type for the file. If you can try to zip the file up, and then attach the zip file, that would probably work (or changing the extension to .txt may do the trick, too).

Hmm. I had noticed that it wouldn’t handle files properly unless they had a commonly recognised extension. Then I promptly forgot about it when I tried to attach files to that last post. I had to zip the XML file because it was so big, and if I’d had the other half of my brain available at the time, would have put both files in the archive.

File Attachments
parsing-example.zip  (File Size: 23KB - Downloads: 125)
 

 
  [ # 12 ]

Having looked at the XML output, I have but one question, illustrated by an excerpt as an example:

<Sentence>"<Opening_Speech>Good-bye!</Opening_Speech>" <Joining_Phrase

Just out of curiosity, why are the quotes outside the opening_speech tags, rather than inside? smile

 

 
  [ # 13 ]

In order to test my theories about the bug in the attachment script, I’m attaching the grammar file as a text (.txt) document.

[edit]
As I suspected, the file can be downloaded. I’ve already submitted a bug report, so this should help the staff to figure things out.
[/edit]

File Attachments
eng.grammar.txt  (File Size: 5KB - Downloads: 142)
 

 
  [ # 14 ]
Dave Morton - Nov 26, 2011:

Having looked at the XML output, I have but one question, illustrated by an excerpt as an example:

<Sentence>"<Opening_Speech>Good-bye!</Opening_Speech>" <Joining_Phrase

Just out of curiosity, why are the quotes outside the opening_speech tags, rather than inside? smile

The quotes are outside the tags because the quotes aren’t part of the speech, unless of course the speech is being quoted within someone’s speech, as happens in one place in the story.

In a sense, the quotes are a kind of markup tag in natural language.

From a practical point of view, if you are writing a stylesheet transformation that extracts all the text from the file and formats it to make it prettier, you might want to put all the speech in curly quotes instead of straight quotes. Much better if the straight quotes have already been removed for you isn’t it. smile

 

 

 
  [ # 15 ]

Fair enough. that answer satisfies my curiosity. smile Thanks.

 

 1 2 > 
1 of 2
 
  login or register to react