So I thought it was time for a bit of an update on progress….so here goes
BTW most of this post will hardly even touch upon AI, as for the past couple of weeks, I have been working on other area’s which while relevant to the ALF end goal, are not really anything to do with AI development. However, topics covered are most certainly things that many will have to think about if wanting to develop an AI that is closer to “strong” than just a mere chatbot.
Anyway, since the last update, quite a bit has happened with ALF. Continuing on what I was working on the last time I posted, pronoun work was finished and I got cracking on the concept of “time”.
Took quite a bit of “time” to get ALF to understand correctly as there are a number of rules to understanding how time in a sentence should work. Everything is relevant to the current time, which of course is always moving.
For example, if it is Wednesday and a subject states “I went to the park last Monday!”, that subject means 10 days ago, the Monday before the just passed Monday. Even though the Monday just past, is correctly last Monday.
Similarly if the same subject states “I went to the park on Monday” the subject means the Monday just passed, 2 days ago.
Another tricky one is the statement of “I am going to the park on Monday”, as this is future tense, if it is Wednesday, then the subject means the coming Monday in 5 days time. Confusingly if the subject stated “I am going to the park next Monday”, that subject STILL means, the coming Monday in 5 days time.
The same rules apply for months, years etc, so its very much one size fits all. Phew.
From here I decided it was probably a good time to look at some large scale knowledge aquisition and the ability to recall. What a job this has turned out to be!!
After working on a number of very large scale, commercial grade projects at one of my companies, I am well versed in the issues with under-estimating the effects of scaling. On a number of occasions when putting an architecture together, using for development, then scaling up to commercial grade, a number of things have been overlooked, or under-estimated, which have led to redesign of some kind.
I wanted to avoid this and thus decided to jump in the deep end, and start at the kind of scales I expect to get to with regard to data handling and management.
To help highlight issues regarding any architecture (hardware and software) problems, I decided to start off with full mirrors of DBPedia, Freebase and YAGO databases. Final database size was around 100GB
Starting crude, these were simply dumped into a MySQL database running on a fairly hefty piece of hardware in its own right. Quadcore Intel i7, 16GB, 2 SAS 500GB drives in a RAID 0.
The object was to see how much load such a database presented to respectable hardware, and to collect some baseline timings on a bunch of queries etc. As I assumed, it ran like a dog!
Next I wanted to see what kind of hardware would be required to achieve a “useable” knowledge base on a single box. By “usable”, I was aiming for ~10 minute query time for a fairly complex query. As with all databases, CPU mean’s nothing, RAM is king and fast discs are queens.
Results on a query of “list all Bruce Willis filmography, tv appearances etc” ended up at 6m32secs after all possible queaks and tuning on MySQL. 100GB DB, hardware was the same as above, but 32GB installed, with 6 SATA 320GB 15000RPM drives in RAID 10 config.
That brings us to a few days back.
I was researching the possible use of SPARQL as a query engine to a MySQL backend, research was done using Jena, which, if any of you don’t know, is a RDF Ontology library that comes bundled with its own SPARQL engine called ARC that can interface with a number of various DB systems.
All was going well until I decided to try this out on the hardware above, and a database import through Jena with the various ontologies ensued. First shock was that DB size had increased to around 150GB, and when running the same query above, well, I gave up after 30 mins waiting. Something else was needed.
After spending some time poking around in Jena and the DB to see what it was doing, it turned out that there was a great deal of redundancy in the database. I wont go into details, but to give an idea, the very same database, compessed with WinRAR’s “fastest” method, came out at a paltry 26GB.
I decided to ditch SPARQL and stick with what I know best, MySQL. The idea is that a middle, custom layer, that will provide a lot of the kind of thing that SPARQL provides, with a very simple, backend MySQL database performing simple queries.
To try and preserve the RDF relationships between elements, and keep the size down, I have written a converter that takes a standard RDF or N-Triple data format, and convert’s this to a less redundant form. Size of the source files after being passed through the converter are around 60% of the original, and yet still contain all information of the original sources. As a result, DB size dropped to around 50GB (half essentially).
Further more, the “bruce willis” query above now executes in just over 5minutes, which is a good improvement just by juggling the data a little.
Next I moved to hardware, whilst 5 minutes is an improvement over the 6minutes (and an hour! , I have a target to resolve that query in 5 seconds or less. Obviously the next port of call is hardware, and this is what I have been doing yesterday and today.
Lucky for me, one of my businesses deals in various consumer electronics, computer’s and components being one of the many things that we sell. So it seemed only right that , being my stock, I should raid the warehouse for the stuff I needed, so I did, and brought home a big box full of goodies!
The idea was to build a cluster that will serve 2 purposes:
1. Primary as a distributed DB, where the large database can be split amonst many nodes, and some queries can run in parallel. As they say, many hand’s make light work.
2. Additional processing power for the main “ALF Box” should it be required in the future.
So far I have put together 4 nodes, and assembled them in a vertical fashion, I am now testing the idea behind this, and if proves promising, I have further nodes to build and add to the cluster.
Each node is a Intel Quadcore i5 processor, and will have 16GB of RAM per node. At present each node has a single SAS 320GB 15000rpm drive attached, and is connected to a gigabit switch.
In the attached pictures, there is also a black box next to the node stack, this is the main “ALF Box” that will do all the processing of data from the DB, NLP, Logic etc etc The spec of this is a AMD 8core Opteron with 32GB of RAM and 4 500GB SATA drives in RAID 10.
All the above is probably overkill for now, but as I had the means to go all out, I thought why the hell not
I’ll update as and when Ive got all the plugged together and talking to each other with the results
Adios.