Perhaps this is more of a chicken and egg kind of problem. I tend to agree that for a fully self learning system (similar to how we learn), you need multiple input channels so that the data from one channel can be compared to another (or the rendering output in one format, matches the concept expressed in another format). But, I think it is very difficult to make a system like that from the start. I would propose a boots-trapping technique.
Here is what I am trying to do: first get the text part working well enough, next finish the visual algorithms and also get them working well enough. Only then start designing the details for the algorithm that integrates both into a self learning system, cause only then are all the data structures known and available for testing.
In my experience, I generally start something with a clear picture in mind, only to end up with something completely different. This was the case for text, it will probably be the case for the visual algorithms I have on paper, and I can only dream of the final stage at the moment.