Hey folks,
I posted in another thread in here the other day and mentioned I would create a topic regarding the details of CAESAR once I’d figured out what I can and can’t state about it. So here I am
First, a bit of background brief. Some of the old timers around here (*cheeky grin*) might remember I posted around 18+ months ago regarding a project called ALF that me and my collegues were had been working on for some time. This went through a number of iterations and we finally decided to take what we had learned and pursue an actual set of real goals.
Enter CAESAR.
While CAESAR isn’t strictly a chatbot project, the chatbot aspects come for free, and well, you guys like to hear about this stuff right?
So what is it?
CAESAR is what I believe will be the first general purpose, dare I say it, strong AI (assuming we are successful) which is able to reason, have no environment limitations, solve general problems, and provide solutions for such problems.
Sounds far fetched I know, but this is so far a culmination of around 4 years of work by the team, and I myself have been chasing AI around for years. In school I used to build little robots and code up BBC micros (showing my age) to control and interact with environments, so it’s safe to say, that I’ve been heading to this point for many many years.
That said, even in the event that strong AI isn’t achieved, the technology even this far in, is very useful, especially in the domain of the semantic web, so even if we fail, we succeed…paradoxical I know
What have we done so far?
Currently there are a number of systems in place that support the main “guts” of CAESAR, we wish for it to be able to learn facts and information un-supervised.
Crawler
Our first port of call was data, lots and lots of data. For this we developed a web crawler with specific properties for CAESAR that would churn the web, execute on page javascript while crawling, and perform some preprocessing on that page data for performance enhancements. Many crawlers out there today do not process the on page Javascript, and thus information can be lost, quite a bit of it infact, and this was deemed unacceptable by us.
So far we have 200M pages and rising, that may not seem like a great deal, but a lot of these pages are crawled daily, as content changes constantly and for now at least, the data which that 200M set provides is more than ample for development.
Post Crawl Processing - Structured Documents
Once a page has been crawled it is processed into what we call a “Structured document”, where particles of interest are highlighted, assembled into a manageable tree and stored for later processing. Elements such as headings, paragraphs, sentences, bullet lists, hyperlinks are all processed. Data that is deemed unimportant and discarded, other data is simply flagged as a “potential” information element.
Extraction
The extraction process is the most complicated and involved process in the system, 2nd only to the Rationalizer, and is still undergoing tweaks and improvements after 3 years of work.
This link in the chain takes the structured documents, and passes the text of each structured element through our NLP module. This module cleans up text, performs spelling corrections, and spits out a number of different data sets, POS tagging, Typed Dependency tree’s, Phrasal structures, NER, co-reference, sentence/paragraph polarities and other needed data sets.
Rationalizer
The rationalizer is the main key component of our extraction system, and I have to guard it’s inner workings (this is a commercial project after all). This component of the system processes the datasets produced by the extractor and organises the information into subject, object, action, and property chains.
This then provides a common dataset, which can be applied to any language, scenario, problem, or input, and you have a reliable, structured, dependable representation.
This process can be “ran in reverse” and you will be presented with the original input (albeit with better structure, grammar and spelling depending on your English skills).
Typed Relations & Classifications
Once you have the rationalizer output, classifing and determining relations between objects becomes quite simple. New relations and classifications are created where needed, old ones are updated if new facts have arisin that enforce a change.
Inference & Deduction
With enough data, un-supervised Horn Clause generation from your Rationalizer and Typed Relation/Classification data is surprisingly simple, with few general seed rules required. From that end “common sense” such as “if A is C and B is A, therefore B = C” are possible with minimal processing and effort.
——
Well, thats a good hour spent, I’ll add further details of what we have tomorrow or in a few days time as I should really get back to the grind-stone.
Any questions, observation etc, post them up, and I’ll reply where I can