|
|
Member
Total posts: 6
Joined: Jul 1, 2010
|
I’m working on a chatbot that will try to add to its knowledge as it talks to people. A major challenge for me has been to figure out how to structure the database. This is what I’ve come up with so far if anyone would like to comment.
http://nac53.tripod.com/
A quick explanation:
Each subject starts with a dot. The data associated with a subject is enclosed in curly braces. Everything else is ignored. The data consists of a list of parameter names followed by parameter values. Most parameter names should be easy to figure out. “def” isn’t so straightforward. It refers to the category that the subject is part of. The categories are shown at the top of the file. The chatbot will store information about “specific entities”, mainly, the individual people it talks to. This information is added to the bottom of the file and is given an id# as a subject name.
So, since I’m new to databases, does it look like I’m on the right track, or are there better ways of doing this?
Wayne
|
|
|
|
|
Posted: Jul 7, 2010 |
[ # 1 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
That’s a rather interesting approach to classifying certain things. I would imagine, however, that it will take quite a while to get enough stuff classified to be useful. Still, I’m rooting for you.
I did notice that there were what seemed to be typos in the file. I found this:
tower
phone{def:device
radio{def:device
computer{def:device
.coffee{
near the end of the file.
All in all, this gives me another avenue of investigation for Morti’s next generation of chat engine. Thanks for sharing.
|
|
|
|
|
Posted: Jul 7, 2010 |
[ # 2 ]
|
|
Senior member
Total posts: 974
Joined: Oct 21, 2009
|
I think you have the right approach, I have chosen a very similar way of encoding information about words used by my CLUES engine. Standard relational database design, although highly effective for specialized purposes, I don’t think will do the trick for a chatbot’s database. The reason being that definitions of words, word ambiguity, complexities of generating many parse trees from one user input, make SQL-type DBs not a very effective choice imho.
My bot’s engine (CLUES - complex language understanding execution system) uses an approach of key/value pairs to define word meanings, and a tree-structure of key/value pairs to assign meta data to the (sometimes 1000’s) of parse trees generated from analysis of user input, then uses that data to determine which parse tree represents the most likely intended meaning the user wanted to convey.
|
|
|
|
|
Posted: Jul 10, 2010 |
[ # 3 ]
|
|
Member
Total posts: 6
Joined: Jul 1, 2010
|
Dave Morton - Jul 7, 2010: That’s a rather interesting approach to classifying certain things. I would imagine, however, that it will take quite a while to get enough stuff classified to be useful. Still, I’m rooting for you.
I did notice that there were what seemed to be typos in the file. I found this:
tower
phone{def:device
radio{def:device
computer{def:device
.coffee{
near the end of the file.
All in all, this gives me another avenue of investigation for Morti’s next generation of chat engine. Thanks for sharing.
I wanted to make sure the database is well structured before I take the time to add a lot of data. When I’m satisfied with it, I may put the chatbot on the web and users should be able to teach it quite a bit. At least the chatbot will have a simple frame of reference.
Some of the data is incomplete (a.k.a. typos) as I just quickly tossed stuff in there as it crossed my mind.
Wayne
|
|
|
|
|
Posted: Jul 12, 2010 |
[ # 4 ]
|
|
Member
Total posts: 6
Joined: Jul 1, 2010
|
Victor Shulist - Jul 7, 2010: I think you have the right approach, I have chosen a very similar way of encoding information about words used by my CLUES engine. Standard relational database design, although highly effective for specialized purposes, I don’t think will do the trick for a chatbot’s database. The reason being that definitions of words, word ambiguity, complexities of generating many parse trees from one user input, make SQL-type DBs not a very effective choice imho.
My bot’s engine (CLUES - complex language understanding execution system) uses an approach of key/value pairs to define word meanings, and a tree-structure of key/value pairs to assign meta data to the (sometimes 1000’s) of parse trees generated from analysis of user input, then uses that data to determine which parse tree represents the most likely intended meaning the user wanted to convey.
So it’s looking like I haven’t likely been wasting my time. That’s good to hear.
I haven’t heard of key/value pairs or parse trees. I’ll have to read up on that.
I wouldn’t mind taking a look at part of your database file just so I can compare.
Wayne
|
|
|
|
|
Posted: Jul 12, 2010 |
[ # 5 ]
|
|
Senior member
Total posts: 974
Joined: Oct 21, 2009
|
I have several files, and within a year or two there will be perhaps hundreds or thousands of files (plain text) in the entire db. For faster look up, I break down the files based on the first 2 charactors. Example, all information about “dog” or “door” would go into a directory named “do”. Then inside that directory is 2 files, one for all the grammer properties, and one for ‘world knowledge’.
so do/grammer-props.txt would contain “pos = common-noun dog” (part of speech,pos)
and do/world-props.txt would contain things like “has-legs = true; num-legs = 4”
for more than one word nouns like “new york” that would be in directory “ne2” - because first 2 letters are ne, and it is a 2 word term.
|
|
|
|
|
Posted: Jul 12, 2010 |
[ # 6 ]
|
|
Senior member
Total posts: 257
Joined: Jan 2, 2010
|
Victor,
I would be very interested in looking at a couple of text files: grammar properties and world knowledge. Is that possible?
Regards,
Chuck
|
|
|
|
|
Posted: Jul 13, 2010 |
[ # 7 ]
|
|
Senior member
Total posts: 974
Joined: Oct 21, 2009
|
Oh yes, I have grammer knowledge and what I call ‘world knowledge’ working together in order for my bot to understand the parse tree that the user “really meant”.
So far my bot’s engine does 3 stages of processing (about 3 other stages to be developed later this year and next year).
stage 1 - text segmentation and morphology (example knowing that “lightly” is an adverb and relates to adjective “light”).
stage 2 - generate all possible grammatical ways of parsing user input (based of course on GRAMMAR rules)
stage 3 - of all those (sometimes 100’s) of parse trees, use real world knowledge of those words to pick parse trees which represent what the user really meant (or probably meant).
Example ” I shot an elephant in my pajamas”
CLUES generates (@stage 2) both of :
1) “in my pajamas” is modifying “I”
2) “in my pajamas” is modifying “shot”
3) “in my pajamas” is modifying “elephant”
it generates those 3 possibilities based on grammar knowledge
then, with world knowledge it promotes certain parse trees.
based on fact that “I” is a person, and “pajamas” are clothing. . .and people wear clothing, it promotes the hypothesis of #1.
|
|
|
|
|
Posted: Jul 13, 2010 |
[ # 8 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
“I shot an elephant (that was) in my pajamas.” {case #3}
Um… Ow?
I can see a LOT of potential here, Victor. Keep up the great work!
|
|
|
|
|
Posted: Jul 13, 2010 |
[ # 9 ]
|
|
Senior member
Total posts: 257
Joined: Jan 2, 2010
|
Vic,
I’ve been reading a tutorial on Natural Language Processing today. Its an interesting read and has got me thinking more deliberately about analyzing sentences. I even took a timeout to night to read an overview of AIML online.
So can I see a couple of actual files? I’m very curious how you’ve formatted this information. Perhaps you can post a link or email something if you don’t mind. I’ve seen AIML but am curious how you and others might format knowledge. This is a task I must undertake soon.
Thanks,
Chuck
|
|
|
|
|
Posted: Jul 13, 2010 |
[ # 10 ]
|
|
Senior member
Total posts: 974
Joined: Oct 21, 2009
|
The format I use, believe it or not, for storing information about words is extremely simple! The ASCII text file goes like this:
Each line is…
<term> <space> name=value | name2=value2 ...etc
Now <term> may be one word like “Elephant” or more than one word (like “New York”).
one or more spaces delimit the word and information about it, and each name=value pair is delimited by v-bar.
an example of name=value would be pos=noun (part of speech) so to tell CLUES that a ‘chair’ is a common-noun…
ch/grammer.txt ——————————-
chair pos=common-noun
Now I have inheritence also - you can tell CLUES that if something is a common-noun it is also just a noun
that is in inherit.txt:
pos=common-noun > pos=noun
tells it if a word is a common-noun it is also a noun.
Dave,
““I shot an elephant (that was) in my pajamas.” {case #3}”
yes, in does consider that case… until it learns that it “likes the other case more” ... if it has reason to like the other case more (and one reason is that it knows that people wear clothing *generally* more than animals (and it knows elephants are animals).
CLUES works by “generally what is more likely” - no absolutes, and nothing is ever known 100% for sure .. gee.. that kind of reminds me of real life doesn’t it ?
Actually.. CLUES works by “mostly no absolutes” (otherwise that would be absolute).
and MOST of the time you don’t know anything for 100% certainty.
Yes, things like 1+1=2 are 100% certain and absolute, but arithmetic isn’t real life.
|
|
|
|
|
Posted: Jul 13, 2010 |
[ # 11 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
By your response, Victor, am I to assume that you use some mechanism such as weighting to determine which case is most likely to be chosen? Or do you use another type of scoring system altogether?
|
|
|
|
|
Posted: Jul 13, 2010 |
[ # 12 ]
|
|
Senior member
Total posts: 974
Joined: Oct 21, 2009
|
So after stage 2 processing generates all **grammatically possible** parse trees (yes, a very CPU intensive process), whether they make sense or are ridiculous (‘elephant in pajamas), stage 3 applies known things about the world to each of them.
So.. stage 3 processing says, hum, I see case #1 involves a person (“I” - it generalizes that “I” is a person), knows pajamas are clothing based on word database, and concludes that it “makes sense” for “I” to be modified by the *prepositional phrase* “in pajamas”. So the parse tree for case #1 gets a “merit point”.
If the other 2 cases above do not get any merit points, they fall back and the system “likes” case #1 the most - and goes with that, and thus assumes that is what ‘the user really meant’.
Now yes, “I” could be a bot.. if we have it talk to a bot… that is where other stages (yet to be written) will come in, stage 4 processing (“context sensitive - stateful conversation”) - where it will take into consideration that specific conversation, what was said in it, what information it knows about who or what it is talking to.
|
|
|
|
|
Posted: Jul 14, 2010 |
[ # 13 ]
|
|
Senior member
Total posts: 257
Joined: Jan 2, 2010
|
Vic,
Thanks for the overview and the basic formatting.
I’m impressed with stage 2 and 3. I’m curious, have you got this bit functioning decently? Or is it still being tweaked?
Dave,
Vic speaks of ‘merit’ points…is that what you meant by ‘weighting’?
Regards,
Chuck
|
|
|
|
|
Posted: Jul 14, 2010 |
[ # 14 ]
|
|
Senior member
Total posts: 974
Joined: Oct 21, 2009
|
Yes, I would say I do have it functioning fairly decently - all testing so far, including tests with “complex” sentences (English grammar defines complex sentence as one in which there is one main clause and one or more subordinate clauses) show it to be *extremely* promising !
I am absolutely convinced that any chatbot that hopes to win a Turing test must first be based on grammar, combined with knowledge of the world - first generate all possible grammar interpretations, then filter based on what words have valid associations based on the world knowledge.
I have run into some practical issues, but a “proof of concept” is functioning. By practical issues, I mean, how I will organize all code and data to make it manageable. Right now code and data are rather tightly coupled.
|
|
|
|
|
Posted: Aug 27, 2010 |
[ # 15 ]
|
|
Senior member
Total posts: 623
Joined: Aug 24, 2010
|
Hi there, everyone. I’m new here, and I have been having a blast looking through this forum and finding so many chatbot enthusiasts. I’m working on a project myself off and on for the past four years. Looking forward to contributing to the forum, and learning from you all.
Victor Shulist - Jul 13, 2010:
stage 3 - of all those (sometimes 100’s) of parse trees, use real world knowledge of those words to pick parse trees which represent what the user really meant (or probably meant).
I used to have my parser do the same thing, generating often hundreds of grammatical interpretations depending on the complexity of the input sentence. But it was just so slow. And often chose inaccurately. Part of this was a small set of learned grammatical forms to judge the new sentence against, but I feel like as that database grew, the time factor would only get worse. How long does it take your parser to make a decision about sentence grammar? How much does it depend on the length/complexity of the input (what’s the spread)?
The fact that I’m working with python probably doesn’t speed the process anyway. But I’d really like to go back to generating multiple options and culling, rather than my current process of culling as I go.
|
|
|
|