|
|
Senior member
Total posts: 473
Joined: Aug 28, 2010
|
This week I’ve been working on a natural language parser for English, specifically on verb forms. So far I’ve covered number, person, tense, aspect and polarity and it has resulted in hundreds of distinct forms for each verb, and many of these forms can be written in a number of different ways (e.g. shall not, will not, won’t, shan’t)!
I still have voice, mood, verbals and transitivity to do.
Here’s a small sample:
(ask :repetitive :past-perfect :third :plural t) ((((had) (kept))) ((on (asking))))
(ask :repetitive :past-perfect :third :plural t) ((((had) (kept))) (asking))
(ask :repetitive :past-perfect :third :plural :not) (((((hadn’t (had not))) (kept))) ((on (asking))))
(ask :repetitive :past-perfect :third :plural :not) (((((hadn’t (had not))) (kept))) (asking))
(ask :inchoative :past-perfect :third :plural t) ((((had) (got))) (asking))
(ask :inchoative :past-perfect :third :plural t) ((((had) (got))) (asked))
(ask :inchoative :past-perfect :third :plural t) ((((had) (been))) ((going ((to (ask))))))
(ask :inchoative :past-perfect :third :plural :not) (((((hadn’t (had not))) (got))) (asking))
(ask :inchoative :past-perfect :third :plural :not) (((((hadn’t (had not))) (got))) (asked))
(ask :inchoative :past-perfect :third :plural :not) (((((hadn’t (had not))) (been))) ((going ((to (ask))))))
(ask :progressive :past-perfect :third :plural t) ((((had) (been))) (asking))
(ask :progressive :past-perfect :third :plural :not) (((((hadn’t (had not))) (been))) (asking))
(ask :repetitive :future :third :plural t) ((((will) (keep)) ((shall) (keep))) ((on (asking))))
(ask :repetitive :future :third :plural t) ((((will) (keep)) ((shall) (keep))) (asking))
(ask :repetitive :future :third :plural :not) ((((won’t (will not)) (keep)) ((shan’t (shall not)) (keep))) ((on (asking))))
(ask :repetitive :future :third :plural :not) ((((won’t (will not)) (keep)) ((shan’t (shall not)) (keep))) (asking))
(ask :inchoative :future :third :plural t) ((((will) (get)) ((shall) (get))) (asking))
(ask :inchoative :future :third :plural t) ((((will) (get)) ((shall) (get))) (asked))
(ask :inchoative :future :third :plural t) ((((will) (be)) ((shall) (be))) ((going ((to (ask))))))
(ask :inchoative :future :third :plural :not) ((((won’t (will not)) (get)) ((shan’t (shall not)) (get))) (asking))
(ask :inchoative :future :third :plural :not) ((((won’t (will not)) (get)) ((shan’t (shall not)) (get))) (asked))
(ask :inchoative :future :third :plural :not) ((((won’t (will not)) (be)) ((shan’t (shall not)) (be))) ((going ((to (ask))))))
(ask :progressive :future :third :plural t) ((((will) (be)) ((shall) (be))) (asking))
(ask :progressive :future :third :plural :not) ((((won’t (will not)) (be)) ((shan’t (shall not)) (be))) (asking))
|
|
|
|
|
Posted: Nov 25, 2011 |
[ # 1 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
O.O
That’s a lot of “code” for just one word, with still more to go, and I see (or rather, don’t see) a few other variations that could also probably be included. Grammar isn’t my “strong suit”, as far as being able to enumerate semantics in that manner, but my brain seems able, and it tells me that some things are missing.
For example, in American English, “had got asked”, while possibly “proper grammar”, is more or less unused, while “had been asked” is much more common. Also, the contraction for “shall not” is rarely used, and the contraction for “will not” is used, instead. Personally, I like the word Shan’t. It’s got character.
I don’t envy you the task of mapping the semantics of words in this way. It looks to be a really big challenge. Of course I have no doubt that you’re more than equal to the task, but putting myself in your shoes with regard to this project has caused my brain to singe about the edges.
|
|
|
|
|
Posted: Nov 25, 2011 |
[ # 2 ]
|
|
Senior member
Total posts: 147
Joined: Oct 30, 2010
|
I would like a program I could ask: “What is the past tense of be?” and it would return “was”. “What is the present participle of ‘ask’?” -> “asking”. “What is the past tense of ‘go’?” -> “went” and so on.
Then I could include it as an agent, and have other agents query it when they needed information about a verb for whatever processing they were doing.
I would also like the agent to be able to learn new forms and change existing forms. So I could teach it “the past tense of ‘lay’ is ‘laid’, the past participle of ‘lie’ is ‘lain’ or ‘lied’”, in case it didn’t know those or had them wrong.
|
|
|
|
|
Posted: Nov 25, 2011 |
[ # 3 ]
|
|
Senior member
Total posts: 623
Joined: Aug 24, 2010
|
Andrew, it seems to me that many of these conjugations and tenses are common for most verbs. I wonder if it is worthwhile producing an exhaustive list for each verb, or if the verb need only reference which rules it has been seen adhering to (either from scanning text or from human interaction). At any rate, I applaud your effort. This seems like a massive undertaking.
I read a paper a while ago (I can search again for it if you’re interested) that discussed the way that people learn grammatical forms. The main question the paper sought to address was why people could learn new grammatical rules so quickly, with only a small set of examples. For example, some verbs are only found followed by specific prepositions. Some only appear in infinitive phrases while others only in participial. Some only appear in certain contexts.
The authors concluded that people encountering a new word would use whatever the most common grammatical forms would dictate until they encounter counter-examples or correction. The barrier for “belief” in the new rule is very low, and people then tend to only use the learned rule (and any other rules known to be associated with the learned rule).
|
|
|
|
|
Posted: Nov 25, 2011 |
[ # 4 ]
|
|
Senior member
Total posts: 697
Joined: Aug 5, 2010
|
Isn’t this a question of strong and week verbs. The latter simply follows a default set of rules. Strong verbs need to be learned. Also, the more time that passes and the more influences a language has from outside, the faster ‘strong’ verbs disappear.
I’d also leave out all the ‘be’, ‘have’, ‘go’, ‘do’ conjugations. Same for not, which is only needed for the strong verbs.
|
|
|
|
|
Posted: Nov 25, 2011 |
[ # 5 ]
|
|
Senior member
Total posts: 473
Joined: Aug 28, 2010
|
Dave Morton - Nov 25, 2011:
That’s a lot of “code” for just one word, with still more to go, and I see (or rather, don’t see) a few other variations that could also probably be included. Grammar isn’t my “strong suit”, as far as being able to enumerate semantics in that manner, but my brain seems able, and it tells me that some things are missing.
For example, in American English, “had got asked”, while possibly “proper grammar”, is more or less unused, while “had been asked” is much more common. Also, the contraction for “shall not” is rarely used, and the contraction for “will not” is used, instead. Personally, I like the word Shan’t. It’s got character.
I don’t envy you the task of mapping the semantics of words in this way. It looks to be a really big challenge. Of course I have no doubt that you’re more than equal to the task, but putting myself in your shoes with regard to this project has caused my brain to singe about the edges.
I’ve attached a more complete list for the verb “ask” with over 400 distinct uses. There are still many more to add as I haven’t yet considered voice (active, passive) or all the different subjunctive formats (e.g. might, could, must, must not, etc) or verbals (e.g. being asked, likes asking). There are also a few interrogative formats missing such as those using who, which, what, why, where and when.
The goal with this list is to be as exhaustive as possible. Certainly there are many formats which are no longer used, such as the example you gave (“had got asked” rather than “had been asked”) but it is necessary to include them all so the parser can cope with anything meaningful that it is given. I’ll also be including many ungrammatical forms and malformed phrases for the same reason (e.g. ain’t).
As for the difficulty of compiling the list, it is actually a lot easier than you might think.
C R Hunt - Nov 25, 2011: Andrew, it seems to me that many of these conjugations and tenses are common for most verbs. I wonder if it is worthwhile producing an exhaustive list for each verb, or if the verb need only reference which rules it has been seen adhering to (either from scanning text or from human interaction). At any rate, I applaud your effort. This seems like a massive undertaking.
I am only creating this list for one verb, and I am not even creating it manually. I’ve written some Common Lisp software which takes all the rules for composing English sentences (there is a finite number of rules) and generates an example for each distinct case. I can then inspect the list for errors and make suitable corrections to the program.
Once I have a complete list of correct phrases, I can then use the same program that generates the phrases to generate the rules for parsing them. This also defines all the lookup tables that ascribe meaning to each statement, so really, none of this is all that difficult to do.
File Attachments
|
|
|
|
|
Posted: Nov 25, 2011 |
[ # 6 ]
|
|
Senior member
Total posts: 623
Joined: Aug 24, 2010
|
Andrew Smith - Nov 25, 2011: I am only creating this list for one verb, and I am not even creating it manually. I’ve written some Common Lisp software which takes all the rules for composing English sentences (there is a finite number of rules) and generates an example for each distinct case. I can then inspect the list for errors and make suitable corrections to the program.
Once I have a complete list of correct phrases, I can then use the same program that generates the phrases to generate the rules for parsing them. This also defines all the lookup tables that ascribe meaning to each statement, so really, none of this is all that difficult to do.
Ah, I understand now. Thanks for the clarification.
If your parser will work on the level of phrases, I’m curious how you plan to approach recognizing them in text. Obviously there will be overlap—words that can belong to one phrase or another or to the main sentence, as well as phrases that act as one part of speech within another phrase. Right now I’m recognizing these by taking all grammatical combinations and then doing multiple passes to pull out phrases within phrases, etc. (Then making sure there’s at least one main sentence that remains!)
You can imagine how many parses one ends up with given a sentence of some complexity. The real challenge has been throwing away parses efficiently. I haven’t even toyed yet with ranking the remains based on context. (Particularly, how plausible is a given subject-verb or subject-verb-DO combination.) You could probably incorporate this much more swiftly, given your work with YAGO2. Any thoughts on this?
|
|
|
|
|
Posted: Nov 25, 2011 |
[ # 7 ]
|
|
Member
Total posts: 19
Joined: May 6, 2011
|
Dear Andrew,
A few comments/observations. My apologies in advance if what I say is not relevant to what you are trying to accomplish:
(1) Regular verbs can be completely specified by their three principle parts plus an “ing” form. For example: sink/sank/sunk/sinking allows you to generate present, simple past, simple future, perfect tenses, gerund, and participle. Irregular verb specifications can be limited to listing those cases which disagree with the regular format.
(2) A deeper understanding of English indicates that the organization in (1) is not actually as complete as it first appears. For example, the simple future tense is specified by “will” as in “will sink”. But, is this really distinct from “can sink”, “may sink”, “could sink”, etc. In other words, there are a host of specialized adverbs that pertain to future time and possibility that are interchangeable with “will” and seem as worthy of being classified as a “tense”.
(3) Once you open the Pandora’s box of words being inserted into verb phrases, it brings up the annoying (though legal) possibility that input sentences will break up said phrases with various adverbs, prepositional phrases, and other annoyances. e.g., “I will most certainly come to dinner.” or “I have with all my heart tried to win.” The point is that a parser has no small difficulty in determining whether “I have” is the end of the verb as in “I have two hands” or is just the beginning of a longer verb construction whose tail lies down the road somewhere in the sentence in question.
Such is the lot of natural language programmer ... good luck!
|
|
|
|
|
Posted: Nov 26, 2011 |
[ # 8 ]
|
|
Senior member
Total posts: 473
Joined: Aug 28, 2010
|
Thanks Eulalio, your comments are entirely relevant. I’ll try to describe some of the rules that I’m using here. It would have been easier to just paste the Common Lisp code as it is quite succinct, but it doesn’t format well in this medium and some folks are scared of parentheses.
For English, every verb has a number of principal parts:
infinitive ; be eat walk present-participle ; being eating walking past-participle ; been eaten walked present-third-singular ; is eats walks preterite =past-participle ; was ate =walked past-plural =preterite ; were =ate =walked present-plural =infinitive ; are =eat =walk present-first-singular =infinitive ; am =eat =walk
The principal parts of regular verbs comprise only four distinct inflections (walk, walking, walked, walks) whereas irregular verbs have five (eat, eating, eaten, eats, ate). Only one verb “to be” has eight. Simple present tense and simple past tense can be conjugated using the principal parts alone:
present first singular =present-first-singular ; i eat present second singular =present-plural ; you eat present third singular =present-third-singular ; it eats present first plural =present-plural ; we eat present second plural =present-plural ; you eat present third plural =present-plural ; they eat
past first singular =preterite ; i ate past second singular =past-plural ; you ate past third singular =preterite ; it ate past first plural =past-plural ; we ate past second plural =past-plural ; you ate past third plural =past-plural ; they ate
Other tenses are composed of conjugations of an auxiliary verb combined with a principal-part of the main verb.
future =will infinitive ; i will eat past-perfect =have(past,person,number) past-participle ; he has eaten present-perfect =have(present,person,number) past-participle ; they have eaten future-perfect =have(future,person,number) past-participle ; you will have eaten past-future =be(past,person,number) to infinitive ; you were to eat past-future =would infinitive ; you would eat
This much covers the following grammatical categories:
person(first,second,third)
number(singular,plural)
tense(past,present,future,past-perfect,present-perfect,future-perfect,past-future)
I’ve also researched and compiled rules (which I am still refining and debugging) for the additional grammatical categories:
aspect(emphatic,progressive,inchoative,repetitive)
mood(indicative,imperative,interrogative,subjunctive)
voice(active,passive)
polarity(affirmative,negative)
And then there are constructs called verbals which turn verbs into nouns (e.g. I like eating, He provided encouragement). Haven’t started coding those yet so I’m not sure how they’ll turn out.
|
|
|
|
|
Posted: Nov 26, 2011 |
[ # 9 ]
|
|
Senior member
Total posts: 473
Joined: Aug 28, 2010
|
C R Hunt - Nov 25, 2011:
If your parser will work on the level of phrases, I’m curious how you plan to approach recognizing them in text. Obviously there will be overlap—words that can belong to one phrase or another or to the main sentence, as well as phrases that act as one part of speech within another phrase. Right now I’m recognizing these by taking all grammatical combinations and then doing multiple passes to pull out phrases within phrases, etc. (Then making sure there’s at least one main sentence that remains!)
You can imagine how many parses one ends up with given a sentence of some complexity. The real challenge has been throwing away parses efficiently. I haven’t even toyed yet with ranking the remains based on context. (Particularly, how plausible is a given subject-verb or subject-verb-DO combination.) You could probably incorporate this much more swiftly, given your work with YAGO2. Any thoughts on this?
I’m using the GLR parser that I’ve been researching and developing for the past few years. You may recall that I posted about it a few months back. I’ve attached the latest versions of an example grammar file for discourse analysis and its XML output to this post.
The parser efficiently handles context free grammars expressed in Chomsky normal form, which is a kind of algebra for languages. This means that the entire parsing process can be encapsulated using a collection of abstract formula and made completely independent of the programming that provides the functionality. The grammar definition is compiled into sets of look up tables and compiled C code which performs the actual parsing at run time.
The parser is extremely efficient. It examines the input one character at a time and determines everything that the input could possibly represent. Then it gets the next character and eliminates from the list of candidates anything that could no longer match the input, and so on. It never has to examine an input character or consider any given possible candidate more than once.
As far as performance and capacity is concerned, I’ve tested it successfully with grammars containing hundreds of thousands of rules parsing text files many megabytes in size, which it can process in a matter of seconds. The attached example, a ten thousand word story, only takes a few milliseconds to break down into XML.
File Attachments
|
|
|
|
|
Posted: Nov 26, 2011 |
[ # 10 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
It appears that the file attachment function is failing, Andrew. I’ll submit a bug report on this matter. The zip file downloaded just fine, but the grammar file failed to do so (giving me a blank page several times), so it’s likely due to a failure of the attachment script to determine a correct mime-type for the file. If you can try to zip the file up, and then attach the zip file, that would probably work, or you can try changing the extension to .txt, which would also serve to help diagnose the problem.
|
|
|
|
|
Posted: Nov 26, 2011 |
[ # 11 ]
|
|
Senior member
Total posts: 473
Joined: Aug 28, 2010
|
Dave Morton - Nov 26, 2011: It appears that the file attachment function is failing, Andrew. I’ll submit a bug report on this matter. The zip file downloaded just fine, but the grammar file failed to do so (giving me a blank page several times), so it’s likely due to a failure of the attachment script to determine a correct mime-type for the file. If you can try to zip the file up, and then attach the zip file, that would probably work (or changing the extension to .txt may do the trick, too).
Hmm. I had noticed that it wouldn’t handle files properly unless they had a commonly recognised extension. Then I promptly forgot about it when I tried to attach files to that last post. I had to zip the XML file because it was so big, and if I’d had the other half of my brain available at the time, would have put both files in the archive.
File Attachments
|
|
|
|
|
Posted: Nov 26, 2011 |
[ # 12 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
Having looked at the XML output, I have but one question, illustrated by an excerpt as an example:
<Sentence>"<Opening_Speech>Good-bye!</Opening_Speech>" <Joining_Phrase>
Just out of curiosity, why are the quotes outside the opening_speech tags, rather than inside?
|
|
|
|
|
Posted: Nov 26, 2011 |
[ # 13 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
In order to test my theories about the bug in the attachment script, I’m attaching the grammar file as a text (.txt) document.
[edit]
As I suspected, the file can be downloaded. I’ve already submitted a bug report, so this should help the staff to figure things out.
[/edit]
File Attachments
|
|
|
|
|
Posted: Nov 26, 2011 |
[ # 14 ]
|
|
Senior member
Total posts: 473
Joined: Aug 28, 2010
|
Dave Morton - Nov 26, 2011: Having looked at the XML output, I have but one question, illustrated by an excerpt as an example:
<Sentence>"<Opening_Speech>Good-bye!</Opening_Speech>" <Joining_Phrase>
Just out of curiosity, why are the quotes outside the opening_speech tags, rather than inside?
The quotes are outside the tags because the quotes aren’t part of the speech, unless of course the speech is being quoted within someone’s speech, as happens in one place in the story.
In a sense, the quotes are a kind of markup tag in natural language.
From a practical point of view, if you are writing a stylesheet transformation that extracts all the text from the file and formats it to make it prettier, you might want to put all the speech in curly quotes instead of straight quotes. Much better if the straight quotes have already been removed for you isn’t it.
|
|
|
|
|
Posted: Nov 26, 2011 |
[ # 15 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
Fair enough. that answer satisfies my curiosity. Thanks.
|
|
|
|