Hi folks,
I just started working/pondering one aspect, and would love to hear the thoughts/suggestions/critique of any fellow coders here. It’s about ‘lexical’ or ‘syntactic’ parsing with pattern phrases. (I am trying to avoid the use of corpusses for ‘POS tagging’ as long as possible) This is a VERY long post, so it’s likely that nobody bothers to read it, but it still serves to organize my thoughts
Allright, so, this is in the context of my still-in-infancy chatbot Yoko, who works by matching use input to ‘pattern phrases’ who in turn are linked to ‘meaning + parameters’ which is then processed by looking up or storing in her knowledge database etc etc.
So as a concrete real-world example: the user inputs ‘are cats animals?’, this matches (the generated regex-version of) the ‘are [somethings] [somethings]’ pattern, which is linked in turn to the meaning QUESTION_HAS_CLASS_PARENTCLASS which is then processed from there.
Currently these pattern phrases are defined in a bunch of .json files as follows:
{
"pattern" : "are [somethings] [somethings]",
"examplephrase" : "are cats animals?",
"meaning" : "HAS_CLASS_PARENTCLASS",
"params" : {"class" : "1", "parentclass" : "2"}
}
(the “1” and “2” here indicate which [something] maps to which parameter for the further processing, the rest should be pretty obvious)
Anyway, this is quickly running out of hand, as I’m starting to reach the 100s range of these, and still getting nowhere close to capturing even a fraction of possible phrase structures with them. This caputres ‘are cats animals’, but for ‘is a cat an animal’ I need another block like this, etc…
I’ve considered many improvements for this verbosity, and already have a bunch of ad-hoc ones in place. To stick with the cats/animals thing a ‘sub pattern’ that matches both ‘a something is’ and ‘somethings are’ for example. But the singular/plural distinction affects other parts of the same phrase in so many ways, that I prefer sticking to patterns for entire phrases for now.
Or another one would be to write the parameter meaning right inline: “a [something|class] is a [something|parentclass]”, but I don’t like how this hurts readability.
(already, and in the future, there is of course some separate logic for processing more complex phrases like ‘cats are animals that can meow’. But now I’m focusing on ‘atomic’ ones)
Ok, here’s how I think I’ll have my new pattern definitions look:
{
"patterns" : [
"are [classes] [parentclasses]",
"is (a/each/every) [class] (always)? a [parentclass]",
"(I wonder/do you know) (if/whether) [classes] (always)? [parentclasses]"
"meaning" : "HAS_CLASS_PARENTCLASS",
},
All the wildcard and multiple choice stuff should look familiar to people with AIML/chatscript/regex experience, but those are just there to make this as realistic as possible - they are taken straight from Yoko, but but that’s not the new and exciting part.
The new part is that there is now ONE block where previously there would have been THREE, because they are all associated with the same meaning. Moreover, there is no [something] anymore, with a separate line to map each [something] to a meaningful parameter, but rather the parameters THEMSELVES contain both the ‘lexical form’ information and the semantic (parameter) meaning, while the whole is still readable.
I love that the pattern phrase itself contains all this syntactic+parameter meaning, the definition is so much more compact, while it all remains very readable, to the point where I consider not bothering with example phrases anymore (though they are great for unit tests!).
Sadly this improvement will also come at a cost of slightly increased complexity in a way: instead of almost everything in the pattern phrase being a [something] or [somethings], there will now be many more cases:
BEFORE:
a [something] is a [something] -> CLASS_HAS_PARENT_CLASS (‘a cat is an animal’)
a [something] can [something] -> CLASS_HAS_ACTION (‘a cat can meow’)
AFTER:
a [class] is a [parentclass]
a [class] can [performaction]
... The extra complexity is that I will introduce a separate mapping, that translates all those [class] type parameters in TWO others: a lexical and a synthactic one. The added benefit is that the lexical one will be more precise than before, which will allow me in the future to incorporate irregular plurals and verb conjugations, etc etc…
Here is the additional mapping that accompanies the new-style patterns, again in JSON because I love it:
[
"semanticparameter" : "class",
"patternoccurances" : [
{"patternparameter" : "classes", "lexicalform" : "noun_plural"},
{"patternparameter" : "class", "lexicalform" : "noun_singular"},
],
"semanticparameter" : "secondaryaction",
"patternoccurances" : [
{"patternparameter" : "performinganotheraction", "lexicalform" : "verb_gerund"},
{"patternparameter" : "performsanotheraction", "lexicalform" : "verb_presentsimple_3rd"}
],
{
"semanticparameter" : "secondaryclass",
"patternoccurances" : [
{"patternparameter" : "otherclasses", "lexicalform" : "noun_plural"},
{"patternparameter" : "otherclass", "lexicalform" : "noun_singular"},
],
]
As a legend of the above: ‘semanticparameter’ is what will be taken as input to the semantic processing functions (that are directly tied to Yoko’s knowledge), ‘patternparameter’ is what will occur in patternphrases, and ‘lexicalform’ is what will be translated into regexes for the actual pattern matching.
... so, this gives me the power of my more descriptive+compact pattern phrases, AND the flexibility of more regexes (and grammar-specific exceptions for what a regex can’t capture) for each of those lexical forms (currently ‘plural noun’ and ‘3rd person present verb’ are both simply the same [somethings] regex, which more or less matches ‘whatever ends with an s’).
YOU MADE IT THROUGH THE POST! What do you think? Am I overlooking something? Is the introduction of the separate extra mapping not worth the more compact pattern phrases?