Ok, so, enough of the starters, time for the main meal with a bite of meat
The past few days have seen some actual work on AI (shock horror!) and I thought it was an apt time to spill my findings and some initial results.
The scope of the past couple of days has simply been to interact with the KDB (knowledge database) by posing some simple questions in an attempt to retreive the correct answer.
By simple, I mean really simple, such as “Who is Barack Obama?”, “What is an Airbus A380?” I put together a set of these simple questions ranging from people, places, aircraft/cars etc, and other everyday object’s.
To start with theres no real “intelligence” in place, as the first step is to develop some search and select algorithms that can return a set of potential answers that can further be processed for semantic relations and all that jazz, however, baby steps first
Initial the results were DIRE, I got a correct hit rate of around 5% and I’m fairly sure that 5% was sheer luck, most of the “correct” answers I was looking for were buried down in 10-20th place in the set returned.
I spent a bit of time “tuning” the selection algoritms to return a better set, but it soon dawned upon me that I was heading down the road of “hacking” the results to return what I wanted, not really a good solution and improvement was pretty small.
The problem with such large datasets, is that there are many entries very similar to each other, especially by name or other common properties that with simple, ambiguous questions such as the one’s I was posing, would return “bad” sets.
I say “bad” sets, but in actual fact they are all correct, as there isnt just one Barack Obama in the world. However, the “default” answer for a simple question of “who is Barack Obama?” is of course “President of the USA”.
ALF was missing “common sense”, or more specifically, “common default responses” to semi abiguous questions such as the ones I was posing. With more detail in the question, then the potential set reduces significantly, and thus, correct answer hit rate improves drastically.
As a large portion of my KDB is currently made up from DBPedia, which itself is pruned from Wikipedia, I settled on a fairly simple method of giving ALF a means to calculate this “common default response” variable.
There is a great resource at http://dammit.lt/wikistats/ which has hourly dumps of all wikipedia pages hit within the past hour, and after looking at the data, it was quite simple to map these page hits, to the DBPedia RDF entries in the database.
After importing a bunch of these data dumps into ALF, I modified the potential selection algorithms to take into account these hit figures, and return a set that took them into account.
Running my question’s afterwards, saw an immediate answer hit ratio improvement, to around the region of 50%.
Today I’ve mainly studied the algorithm a little deeper, written out a lot of equations, and as I type to you right now, ALF is achieving around 80% with the same questions.
Overall the current algorithm’s are still quite crude, and theres lots of things that I havent taken into account when selecting these potential response sets to a particular question. No doubt it will be an on going evolution of the algorithm’s as I push on further, but I thought it would make a nice post to update you all with.
So on that note…..I’ll be getting back to it
Adios