Eulalio: On the first page of this thread I show some examples of the complex sentence splitter (CSS) at work. In general, the program creates a pattern using two input strings, (1) the complex sentence and (2) a string containing all the simple sentences that the complex sentence conveys. It does this by the following method:
a) Identify words common to both strings.
b) Word groups that always appear together in both strings may be grouped together as a single tag. They are only grouped together if they are also either
- a noun phrase (containing adverbs, adjectives, and nouns)
- a verb phrase (containing adverbs and verbs)
- an adjective “phrase” (containing adverbs and adjectives)
- all of one type (all adverbs, all adjectives, etc.)
c) The tag identifier of the word/word group indicates the principle part of speech (in rank order: noun, verb, adjective, adverb) and a unique number, ordered by occurance in the complex sentence.
d) All occurrances of each common word/word group are replaced in each string by the tag. The new strings comprise the pattern.
In order to match a new sentence against the resulting pattern, a combination of reg ex and pos identification is used to try to map the new sentence to the pattern. If the sentence can be mapped, then each tag in the pattern is assigned to a word/word group in the new sentence. The simple sentence pattern is then filled with these words/word groups and the resulting simple sentences are returned.
As you can see, new patterns must be taught explicitly (by inputing the simple sentences). But the system is fairly robust and extendable to applications beyond complex sentence mapping. For example, I intend to build tools to use the CSS to identify not only entire complex sentences, but word groupings within a sentence that convey an independent idea. So, for instance, if the user inputs the sentence,
“The robot’s body is made of metal,”
the bot could use a learned pattern,
“%%n0 ‘s %%n1”—> “%%n0 has a %%n1 .”
to learn the sentence “The robot has a body.”
Bam, possessives learned.
As for the speed of the CSS, I can’t make any definitive comments at this point. The largest list of patterns the bot has tried to search through was no more than 10. Speed was negligible, but this could change once the pattern list grows relatively large. I’ve tried to design the CSS to identify incorrect patterns as quickly as possible to spare computation time, but I’ve yet to really tax the system.
Victor: The quotes were for emphasis, since I couldn’t remember the italics tags off the top of my head. But now I won’t forget them.