I’m going to go over some thoughts on bots and AI that might offend some people. None of this is meant to diminish or belittle any of the excellent work that has been done on bots or bot systems by any developers so far. These are just observations, and I hope, a call to action for new paths in the field.
I’ve spent a lot of time over the last several months thinking and coding and trying different things, trying to gain traction on the problem of depth in chat. AIML and other pattern/trigger based bots are nothing more than puppets with very clever strings. There’s no way to incorporate real depth into those conversation without adding more clever strings, and at some point, the complexity of the system exceeds the capacity of the computer to process or store it.
Combinatorial explosion happens when you try to automate patterned response bots. This is the sad, but true reality we keep butting our heads against, trying to find some holy grail variation of AIML that will produce a believable Turing level bot.
It’s not possible. Here’s why. Let’s take a controlled language, like Basic English and define some patterns and responses. Of 800 words, let’s say our maximum number of words per pattern is around 8. Assuming repetition is allowed, there are around 167,772,160,000,000,000,000,000 , or roughly 167 septillion 8 word combinations. Obviously Natural Language isn’t as random as that, but those combinations represent the borders of the first dimension of the problem space.
The second dimension of the problem space represents the response to the 8 word input. Now you have a 2 dimensional plane upon which a given point represents an input/response pair.
This is as far as most ELIZA type bots go. They restrict the problem space to a toy domain, and perform the function equivalent of a lookup based on the interpretation of the point in that problem space. The algorithm itself is essentially a 2 dimensional array lookup function.
This algorithm has no nuance, and has no concept of higher order relations. No matter how cleverly you manipulate your search function, you’re still only working with 2 dimensions, and this is what I refer to as the absence of depth.
In order to add depth to the algorithm, you need a third dimension. This means that for every possible 8 word input(x), it gets mapped to an 8 word output(y), which can be any other 8 word output(z).
Consider what this means in terms of computing: You have a 3 dimensional problem space restricted to a vocabulary of 800 words in 8 word sets. This brute force methodology brings you face to face with a monstrous (1.67x10^26)^3 space. Unless you’re utilizing every atom of planet earth’s mass, no computer is going to be able to process this effectively.
There is no physically or mathematically feasible way to implement a search function that approximates general intelligence. Chatbots that incorporate search as the engine of intelligence will only ever work on extremely limited domains.
Applied to real language, of which vocabularies can exceed 100,000 words, the numbers become even more ridiculous. ELIZA style bots are dead ends, except for extremely limited domains.
Human beings don’t have septillions of neurons, so how do we manage conversation and intelligence? Well, we don’t use a brute force search function. Our brains do something far more interesting than “if;then.” Human brains make use of electrical signals caused by chemical reactions in neural axons and dendrites to communicate with other neurons. The arrangement of these neurons is mostly on the surface of the brain, within an organ called the neocortex. Most neural activity happens at the surface; the vast majority of the mass of the brain consists of connections between neurons.
A neural researcher, Vernon Mountcastle, posited that there was only one cortical function, that neurons from any one part of the brain were doing the same thing as neurons from any other part of the brain. The difference in function came about from connections to other organs and their interface with the outside world. The basic unit of computation was the cortical column, or stacks of neurons arranged in roughly 6 layers inside the neocortex.
The biological discoveries about brain function were accompanied by discoveries in artificial neural networks, which turned out to be universal function approximators. This means that any problem space can be modeled in a neural network, and any function within the problem space can approximated through changing the weights of the network. People got really excited in the 80’s when neural networks became more accessible, and many breakthroughs were made in backpropagation training techniques.
The problem with this, in regards to the design of an intelligent chatbot, is that while a neural network can approximate nonlinear functions, it cannot ignore the dimensionality of the problem space. You would still need to exhaustively model the possible inputs and outputs and either incorporate a complex feedback system, or explicitly add a new set of inputs for every previous percept in a sequence you want to affect the output. Neural networks are just as susceptible to combinatorial explosion as the simple search function.
The crux of the problem is that while generating a naive plausible response is easy, real depth in a conversation requires knowing what has come before. Not a perfect memory of each word, but the spreading activation of potentially relevant connections within a semantic network.
The problem space in which a chatbot operates can be reduced fairly easily by dividing it into features. The most common approach to this is to do some sort of POS tagging. Dimensionality reduction is achieved through many statistical means. None of these are necessarily easy, and none of them so far have translated into deep conversation, because no matter how you reduce the problem space, you still have to produce a function that approximates some sort of intelligence.
Neural networks fail because they have no memory, or because it’s extremely hard to model the entire problem space.
2 algorithms of which I am aware, one relatively new and the other relatively unknown, have the potential to produce a functional general intelligence that can serve as a chatbot’s mind.
The first is Numenta’s Cortical Learning Algorithm / Hierarchical Temporal Memory, which is slated to be open-sourced sometime soon. The second is the neural model known as Long Short Term Memory, incorporated into a memory-prediction framework.
Both algorithms use the memory-prediction theory to perform learning and processing.
The downfall of CLA/HTM is the complexity and scalability. Thus far, the implementations don’t appear very optimal, they don’t operate on non-binary inputs, and take a huge amount of memory and processing power. My experiments with it led me to the discovery that while a deep-conversation bot could be built around HTM, the depth would be severely limited unless the algorithm became much more efficient.
This led me to a search for a more efficient algorithm, and I revisited one of my earlier favorite algorithms, LSTM. LSTM is a type of neuron that instead of summing all inputs from a previous layer and applying an activation function, it uses weights from cells in the previous layer to modify logic gates that tell it to memorize, forget, pass input through, or damp input. The model allows memory of activation to be stored indefinitely, making the neuron inherently temporally aware, across an arbitrary number of cycles.
This means that contextual information is preserved and used when needed. After determining that LSTM neurons provided the basic features needed for a deep-conversation bot, I set about searching for a way of implementing the memory-prediction theory.
One way would be to simply train a network over time to predict its current inputs from the previous inputs into the system, using backpropagation. This would mean selecting an initial layer structure and arbitrary constraints. It’s easy to see how that would fail.
Another way, which I am currently exploring, is the use of Autoencoders as a spatial learning technique, and then adding a layer that accepts the output of the autoencoder and the content of each neuron’s memory to predict the next input into the network. This algorithm would dynamicaly add layers when it encounters patterns it’s not capable of predicting, and then the activation of a successful layer would trigger the memory cells to pass their input to a preidiction layer.
I’m still working on the details, but I want to use backpropagation as the learning method, because it’s well understood and extensible to other optimization techniques.
The actual architecture of the bot would consist of an input space that read lines of text as pixels, an output space that behaved as a buffer, and one or more neural networks arranged in a hierarachical manner. The bot should have the capacity to recognize its own output and cogitate accordingly.
If HTM proves to be more efficient, I wouldn’t hesitate to attempt the same sort of architecture to produce a deep-conversation bot, but I have a gut feeling that LSTM will prove a more efficient system altogether, because of its relative simplicity.
One neat thing about either approach is that both are extensible. Once you have a foundational network, you can add capacity to the system and it utilizes the new resources intelligently.