I’ve been meaning to start a thread on my project for a while. Exciting to finally start it!
ALEX stands for Artificial Learning by Erudition eXperiment, and as the name implies, the major focus is learning via natural language. “Chatting” is a later goal. Most of my work is focused on how to design a program that parses/organizes natural language input and uses the organized input to improve its NLP capabilities.
The big goals for ALEX are divided into stages as follows:
Stage 0: A bot that can turn English input into a structured knowledge base. The bot should be able to use this knowledge base to deduce new facts from known facts (simple logic), and new parsing rules from examples of correct parses.
Stage 1: A bot that can read simple English wikipedia entries and construct summaries of the articles. This requires a contextual understanding of an article’s subject in order to deduce which pieces of information are the most significant, which sentences/paragraphs represent a generalization of an idea and which represent specific instances of that generalization, etc.
Stage 2: Use NL input to organize knowledge base facts into “stories” that incorporate temporal and spatial information to organize the facts. The “story” may be no more interesting than “how to pour a glass of milk”, but it will serve to add more context to NL input.
Stage 3: Develop an NL interface for querying the knowledge base and stories. This is where chatting comes in.
Stage 1 is no small feat and I’m not much interested in the finer aspects of chatting, or even dealing with aberrant grammar/spelling, etc., until I’ve accomplished this to some degree.
I’m in the process of overhauling a few parts of the parser and interface, so there won’t be sample i/o until April. (Or at least, the i/o won’t be in complete “input—> final output” form.) In the meanwhile, here’s a play by play of how the parser works. Remember that it is designed to convert NL input into a factual knowledge base.
—
Step 1: Parts of speech (POS) tagging.
I use the NLTK implementation of WordNet to tag verbs, adverbs, and adjectives. ALEX has built in word lists to handle articles, prepositions, and conjunctions. Proper nouns and pronouns are handled, but most nouns are assigned as a process of elimination. (All verbs, adverbs, and adjectives are also potential nouns.)
I do not distinguish between types of nouns (proper, pronouns, etc) nor types of verbs at this stage. All possessive determiners (her, my, their, etc.) are tagged as articles because they behave that way.
I also keep a cache of sentences with their associated POS tags so that commonly encountered sentences can skip this step.
Step 2: Naive POS discrimination.
Step 1 can leave tens to hundreds of tagging combinations (I call them grams—not quite grammars ), depending on the complexity of the sentence. Some of these can be eliminated because they violate simple grammar rules. For example, “her” is tagged as both an article and a noun. All articles must be followed by nouns. Any case of a “dangling” article is removed from the list of possible grams.
The POS discriminator ranks surviving grams using a database of correct grams. Whenever ALEX learns that it has parsed a sentence correctly, the gram for the sentence gets chopped up into gram chunks of three POS and longer and stored in the database. New grams get ranked based on how many of these chunks can be found in the gram. (More “points” for longer gram chunks.)
It’s a naive system to be sure, but the correct gram is consistently ranked in the top 10, and generally in the top 5. I’m always trying to think of little additions that will improve the ranking scheme.
Step 3: Complex sentence splitting.
I absolutely loathe conjunctions. I may build methods for handling them directly in the future, but for now, ALEX has a sophisticated way to use pattern matching to map any sentence containing conjunctions (what I call a “complex sentence”) to one or more simple sentences without conjunctions. Special characters are used to indicate the relationship that the simple sentences have with each other. I’ll go more into this later.
Based on examples, ALEX builds rules (generalized maps) for turning any sentence with a similar gram into a set of simple sentences. Only simple sentences will be parsed correctly in subsequent steps.
Step 4: Phrase culling.
This set of tools isolates clauses and phrases from the main sentence. Every clause and phrase is broken up into a separate sentence. Even prepositional phrases (with the dummy verb “to be”). Any sentence can be a “condition” of another sentence, using what I collectively call “joining words” to describe the relationship to the parent sentence. Joining words can be conjunctions, prepositions, subordinating conjunctions, etc.
Participial phrases are currently assigned the joining word “while” to indicate the timing with reference to the sentence they modify (“parent sentence”), though this might change. Some types of phrases have no joining word, in which case I assign “and”. This might become more sophisticated as needed, but it works well for now.
Sometimes its a trick to attribute the correct subject to dependent clauses, which cannot function independently, in order to turn them into proper sentences. This is where previous experience comes into play and the probability of multiple possibilities are ranked accordingly.
Step 5: Phrase attribution.
All phrases modify some other phrase or the main sentence. (See Victor’s infamous elephant in pajamas example.) Using examples already structured in the knowledge base, the bot will assign all phrases as “conditions” of the appropriate parent sentence. Even the direct object (DO) and indirect object (IDO) of the sentence will be stored in the knowledge base as a separate sentence (with the dummy verb “to be”). Thus the DO sentence can also be the parent of a phrase.
This part is currently being written and I’ll discuss it in greater detail once I’ve fleshed out my algorithms more.
Step 6: Parse tree formation.
Based on simple grammar rules, each sentence is divided into a python dictionary that contains the following:
1) Subject
2) Main Verb
3) Verb Phrase (gerunds/infinitives)
4) Adverbs (modifying the verb/verb phrase)
5) Adjectives (includes adverbs that modify the adjectives)
6) Direct Object (and dependents, see below)
7) Indirect Object (and dependents, see below)
8) Conditions
Condition lists contain members with two elements: the joining word, indicating how the condition is related to the parent sentence, and an ID (special token) identifying which tree is the condition.
Each direct and indirect object is the subject of its own parse tree. In other parse trees, they are referenced by their IDs. Each parse tree contains:
9) It’s own unique ID
10) A list of IDs that reference this ID
11) The probability that the tree is true.
The probability is determined by the rank of gram that formed it, how many grammar rules were used to create it, and how well the structured data the tree contains agrees with existing members of the knowledge base, weighted by the probability that those members are true. I’m still refining this part of the algorithm. Expect to hear more on this later.
—
While all of the above steps exist in some form or another, I’m doing a heavy round of revision right now and expect to change things around a bit. Until later in March, I won’t have much time to dedicate to this alas. The plan for this spring/summer is to:
1) Finish the latest round of edits
2) Improve the parser to the point that it can handle complex sentences with a 75% “first try” success rate (lots of training)
3) Work on the algorithms of logic that act on the knowledge base in order to derive new facts from already organized facts (parse trees).
I’m thinking these logic processes will eventually be run while ALEX is not actively being engaged (ie, reading the NL lessons I’ve written). Sort of how our brain processes our experiences while we sleep.