AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Learning chatbot
 
 

My friends and I are making a chatbot as part of a 40 day summer project at our institute. Ten of those days are left now. So far, we’ve done an online course on Machine Learning and have also read major parts of the book on Natural Language Toolkit (NLTK) in Python. The current bot works this way:

When it receives input, it checks in an SQL database of sentences whether it has encountered an exact match before. If yes, it responds according to the recorded response. If no, it tags the words of the sentence according to Part of Speech and groups them in another SQL table, “Words.” Now, we’ve assigned weights to each word in English by taking a reciprocal of the number of their occurrences in many NLTK corpora, multiplied by weight values decided according to which part of speech the word is. Then, it finds the best match with this approach, among the inputs encountered before, and prints the corresponding output.

This is not giving us a good result, though. One of our seniors suggested that we should focus on making the bot learn from a huge database, the kind of sentence types that can come as inputs and how to respond to them. We have no clue how to make that happen. How will the bot automatically learn this? Any suggestions on this?

 

 
  [ # 1 ]

You kind of lost me at “reciprocal”, but if you’re using anything along the lines of neural nets, the poorness of your results will be almost entirely due to having too few appropriate inputs, not so much due to your matching algorithms.
It sounds like you have been focusing on finding near-exact matches to a limited set of inputs. I think you’ll need either a broader, less strict manner of matching a limited set of inputs, or keep your strict matching algorithm and just get your hands on a huge set of inputs like your senior suggests.

You might improve your algorithms a little by using N-grams to score sequential word matches higher than mainly part-of-speech matches, or you could even try dropping the part-of-speech scoring method, because the same word with the same meaning can have different part-of-speech roles in many paraphrasings of an input. I’d focus on gathering more inputs instead though.

The question is where to get them. You could take Cleverbot’s approach, open up your chatbot to a large audience and record their consequtive inputs as both input and output. Other ways would be to pillage scripts, fora, social media or messenger chat logs, with all accompanying privacy issues. Perhaps some of the chatbot owners here would be willing to lend you some of their logs if you’re not going to make permanent use of them, perhaps for a fee.

 

 
  [ # 2 ]
Nishit Asnani - Jun 14, 2015:

My friends and I are making a chatbot as part of a 40 day summer project at our institute. Ten of those days are left now. So far, we’ve done an online course on Machine Learning and have also read major parts of the book on Natural Language Toolkit (NLTK) in Python. The current bot works this way:

When it receives input, it checks in an SQL database of sentences whether it has encountered an exact match before. If yes, it responds according to the recorded response. If no, it tags the words of the sentence according to Part of Speech and groups them in another SQL table, “Words.” Now, we’ve assigned weights to each word in English by taking a reciprocal of the number of their occurrences in many NLTK corpora, multiplied by weight values decided according to which part of speech the word is. Then, it finds the best match with this approach, among the inputs encountered before, and prints the corresponding output.
This is not giving us a good result, though.

Sounds like your flow is:
IF: exact match->respond
Else: use SUM((corpora/response TFIDF) * (word POS-weight)) to select response

Problems can result if the corpus does not match your inputs (which yours doesn’t), and that responses should not be chosen by the words in them. As an example; the input, “Who are you?” would typically not generate a response that includes the words; who, are, or you.

Nishit Asnani - Jun 14, 2015:

One of our seniors suggested that we should focus on making the bot learn from a huge database, the kind of sentence types that can come as inputs and how to respond to them. We have no clue how to make that happen. How will the bot automatically learn this? Any suggestions on this?

10 days is not much time in the chatbot world, and if it was easy to build a bot from a big database, we would have many more bots. It can take a lot of compute power to process the words into a usable form. Deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling is one way to explore this, but it will not give you a chatbot. https://radimrehurek.com/gensim/models/word2vec.html

Suggestions:
In the time you have left, you need lots of inputs/responses. Convert the AIML set into your format for direct matches.

Expand your direct matches to a fuzzy set by using NLTK/Wordnet and looking up synonyms of your input phrases to match outputs.


When finished, you should also consider publishing/open sourcing your code so others in the community can learn from your approach.

 

 

 
  [ # 3 ]

We are into the final stage of our project. We have three algorithms working simultaneously to find the best approximate match for inputs that don’t exist in the database currently - one of them picks consecutive matching words, another implements wordnet for fuzzy matching, and the third is the one that I’ve posted above. Just when we were happy to have made a good working chatbot, we ran into serious trouble.

Since a couple of days, we have been working on increasing our database. on increasing the entries from 10 to a mere 100, the time taken by the chatbot to reply has gone up manifold - it now takes about 55 seconds to reply. Imagine what would happen when our database reaches 1000s.

I shut down each algorithm one by one, and the time fell down. After removing all of them, only the part of speech tagger and another function were acting on the input, but the program still takes about 12 seconds for each input that’s not an exact match. So, here we are - stuck. We don’t know what to do now.

Please help as we have to complete our project.

 

 
  [ # 4 ]
Nishit Asnani - Jun 14, 2015:

My friends and I are making a chatbot as part of a 40 day summer project at our institute. Ten of those days are left now.

So that leaves just today to do this?! Many of us here have been working on our bots for years. To try to make one in a day is a pointless task.

I would stick with the 10 entries and explain to your instructor the approach you took in order to gain any marks from this project.

 

 
  [ # 5 ]

Slow processing is typically caused by the amount of times you access a file or database, or by internet connecting procedures. I would suspect the problem to be the amount of SQL calls or the server it runs on. The more you can do on the client side, the faster things should run.

http://dbscience.blogspot.nl/2009/02/are-mysql-stored-procedures-slow.html

 

 
  [ # 6 ]

As I tried to warn you, “It can take a lot of compute power to process the words into a usable form.”

Increasing the size of your database should actually improve performance with more inputs getting an exact match.


To increase speed: stop lists, regex

If you don’t know what these are, you can’t learn about and code them in 1 day.

Post your algorithms/code in a couple of days and we can tell you where you went wrong.

 

 
  [ # 7 ]

If you really are storing data in sql and then querying SQL tables once or more each response then you really should double check your indexes.  You should index your weight column.  It would be better to load your data into a sql temp table and query that instead.  Better yet load it all into an array in memory.  Is your code compiled or interpreted? You should be able to get better performance because 1000 records really is not that much.  Maybe you can post some of the code sections that you have identified as slow.

 

 
  login or register to react