AI Zone: chatbots.org

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

New in Chatterbots

Posted: May 12, 2011

[ # 16 ]

Gary Dubuque

Senior member

Total posts: 153

Joined: Jan 4, 2010

E-mail Gary

AndyHo - May 11, 2011:

I use ternary search tries (TST) for a long time (years), and they are integrated into my stuff, are useful for robust for word-seeking, but not for pattern matching! If you use it, you need to build a new TST for each pattern-making the whole thing nasty and slow. (I mean runtime). Also this is no a compact stuff, it wastes a lot of space, the big O access is not logarithmic, its asymptotically linear.

I used something like a binary tree for AIML pattern matching. So no, you don’t need a new TST for each pattern. You would do such a method using regular expressions, trying each expression individually, which is why context becomes so important to reduce the number of expressions to try.

It sounds like you consume a single character in each trie when you employ a TST. I am considering consuming a word instead, that is, the actual text of the word and not a symbol of a word derived from using another TST to extract the words into symbols (something like a spell checker, except a match returns token.) I don’t want to deal with a lexicon of all the possible inputs and then have to deal with stuff that may not be in my dictionary such as a person’s name.

With the binary tree I use now for AIML, each trie matches the word for the next trie using a hash table. With a TST it appears I’d be replacing that with a binary search. This is still much faster than your next suggestion.

AndyHo - May 11, 2011:

The best approach is to build an automaton, and implement the graph for making one or more symbols optional, this works fine and is fast and scalable, you may also compile the whole thing, and build a table (transition table) then your automaton must only jump from rows to columns as the symbols are input and if you are in a terminal state when the input symbols ends, you’ve got a match! voila!

A TST is a graph. If implemented in C or C++ you pull the tries from the heap using pointers. There would be little overhead, probably less than when compiled into an automation. If you using a run-time engine like Java or the CLR of .Net, it will be much slower (JIT loading, automatic garbage collection and a bunch of other things, etc.) Furthermore the symbols will typically be matched sequentially in the automaton.

The apparent speed probably comes from the input being converted into symbols which would create a register to register compare instead of a string to string match. But as I pointed out, you lose that speed advantage when you factor in the conversion to the symbols used to match and you introduce another set of issues about maintaining the lexicon.

Also an automaton could run into issues if your optional symbols recurse.

Andy identified that you’d need a transition graph or automaton for each pattern to compare. You might relieve this by representing all the patterns in a single graph. Then we are talking about a grammar. There are several developers here who prefer grammars. They follow up the initial run of the automaton for syntax, the parsing, with additional rule processing to get at the semantics that clarifies the meaning of the inputs because, as you know, what is uttered may have more than one interpretation.

Pattern matching is what you do when your eyes scan ahead in text to get the jist of what you are reading. Parsing is what you do when reading aloud or when reading to yourself so that you hear the words in your consciousness, that inner voice. Ultimately your mind must process the stuff which is where the real salesmanship begins. An effective sales is typically sold in the presentation. Pattern matching or parsing only needs to cue the best sales presentation for the client.

Posted: May 13, 2011

[ # 17 ]

Andres Hohendahl

Senior member

Total posts: 141

Joined: Apr 24, 2011

E-mail Andres

Gary

I agree with you that a word-by-word TST may be useful, but you have to think the cost of having woods of hash tables, the expense of matching hash tables (calculating a hash) it is better than a string comparison, but.. a register integer comparison is absolutely faster.

The conversion toward symbols may be handled only one at the input analysis, and its fast, but keep in mind you might loose the capability of dealing with optional input spellings (which indeed makes your dialog system more robust to errors, specially Spanish and languages where you got written accents, which are mostly forgotten on daily writing, and handling ambiguity on terms is nasty.)

Let me tell you that the .NET framework’s JIT compiler, class loader, garbagge collector aren’t so slow, [yes, the first run is slower] but the the speedup due to the intelligent garbagger is not slower than native code, and some times its faster due to the optimization on memory allocation, specially constructing the tries. I’ve did some benchmark it with same structures, the pointer-addressing isn’t as fast as it should be in theory.

If a Hash-TST would be a optimal solution, ?why compilers don’t use this approx? they actually do table-based automatons, comparing only numbers, doing indexing as simple multiplication of the indexes by register size, and accessing memory directly. And as they claim, and surely are right, this is the most optimal way to do a parser. I won’t argue to them.

Even the threading system of the operating systems (like XP) are many times more demanding and performance starvers than the real code, in many occasions.

But anyone can use whatever he likes most, actually the CPUs are fast enough to sink into it’s power many performance misses, and bad style programming is often misconsidered as a “buy a better CPU, put more memory and it works!”

Posted: May 13, 2011

[ # 18 ]

Gary Dubuque

Senior member

Total posts: 153

Joined: Jan 4, 2010

E-mail Gary

Actually I don’t even bother with the hashing because it is built in (AFAIK) when I use the template:


CTypedPtrMap<CMapStringToOb,CString,Node*> m_WordNodes;

So I don’t deal with “woods” of hash tables, per say. You can try it out in AIMLpad.

The stuff we’re discussing here only is noticeable if there are many, many patterns or users submitting inputs.

The technique you’re describing works when the symbol set is small enough to be the index of tables. I’ve used it in a program that plays rock/paper/scissors for the map that gets mutated through a genetic learning algorithm.

Parsers and compilers can use it when keywords are few in number like as in the C language. Then two passes are used. One, to build/use a symbol table to turn the raw input into table indexes first before two, transiting the parse graph. Using the large number of words in your lexicon would not work for this if you are trying to match all the patterns at the same time as AIML pattern matching does (or any “search index” kind of approach.) Taking each pattern individually so that you are only dealing with the small set of symbols in the pattern, scanning the input to turn the words (or their substitute after the “once” run against the whole lexicon to normalize them) into indexes for the parse graph (which can abort the pattern match early if a word can’t be recognized), and then running the pattern’s automaton is your option.

Fortunately this is well understood and packaged in available libraries, although I’m not sure it is the same in implementations of regular expression engines.

Posted: May 13, 2011

[ # 19 ]

Andres Hohendahl

Senior member

Total posts: 141

Joined: Apr 24, 2011

E-mail Andres

Gary, you are right and I’ll take a look to this techniques you tell me.

On the other hand I don’t use this in that way, I only tell you a possible way. The parser is made up put of a full GLR semantic-syntactic parser generator (ATG grammars) I’ve made on my own, to build a real compiler. making AST trees and those are the interpreted ones for the patterns.

Posted: May 13, 2011

[ # 20 ]

Dave Morton

Administrator

Total posts: 3111

Joined: Jun 14, 2010

E-mail Dave

Andy, I’m not sure how it got by, but you ended up with two identical posts there. I took care of it for you.

Posted: May 13, 2011

[ # 21 ]

Andres Hohendahl

Senior member

Total posts: 141

Joined: Apr 24, 2011

E-mail Andres

Dave Morton - May 13, 2011:
Andy, I’m not sure how it got by, but you ended up with two identical posts there. I took care of it for you.

thanks, I saw a misspell, pressed back and posted again, I never got to know how to edit a post after being posted (I asked earlier) got no response.!

Posted: May 13, 2011

[ # 22 ]

Dave Morton

Administrator

Total posts: 3111

Joined: Jun 14, 2010

E-mail Dave

For some reason, the option to edit posts has been temporarily disabled by the admins. It’s a long story that boils down to a pair of individuals ruining things for everyone else. Nothing I can do about it, I’m afraid. I must have missed the typo, or I would have deleted the other post.

[edit]Unless one of the acronyms is misspelled, I don’t see any problems with your post, Andy.[/edit]

Posted: May 13, 2011

[ # 23 ]

C R Hunt

Senior member

Total posts: 623

Joined: Aug 24, 2010

E-mail C R

A pair of individuals ruining it? When did this happen? I don’t recall there ever being a problem with edit button abuse. Some forums automatically post a notice of the edit date if a post is changed—maybe we could do something like that here? That way it’s clear that the post has been edited, but people can still correct little issues with their wording.

Posted: May 13, 2011

[ # 24 ]

Gary Dubuque

Senior member

Total posts: 153

Joined: Jan 4, 2010

E-mail Gary

Andy, I think your ideas are real cool.

I added this much here to further the investigation of pattern matching for chatbotThesis who started this thread.

Setting up feature extraction (keywords) for a salient list and co-referencing (beyond the immediate need for an anaphora solver, leading to identifying focus shift in the discourse) might be another path for research.

A note from my archives, but I can’t remember where I found it…

Focus is what makes text or set of utterances a discourse. Focus is the central concept. Focus changes, discourse changes. ... A candidate focus might be the first noun phrase following the verb or else the subject. The ending of a discourse could be marked by using a sentence with unrelated co-referential terms.

Useful if operating within a discourse manager (which is built into the structure of ChatScript.)

Posted: May 13, 2011

[ # 25 ]

Andres Hohendahl

Senior member

Total posts: 141

Joined: Apr 24, 2011

E-mail Andres

Gary
Thanks, I’m flattered but my intervention here is also not to demonstrate my knowledge, which is anyways limited to a single human brain (no omnipotence) but rather struggled with many try-fail issues, so I wish with my interventions, others not to fail so often and get into a better path based upon my experience. May be I also missed the path many times, but fortunately there are no way to see if your path is right until you got the face against a wall, an this hurts!
But you learn from this, and experience gets you harder and smarter, if you can capitalize it, learning from errors.
Designing algorithms is a very interesting way of thinking, to be successful you must never release your will in a single no, and not get away with the first bump. And also know when to stop and be smart enough to avoid not reinventing the wheel (I use this phrase many times). Keep your pride in a pocket and go ahead, seek for better ideas, throw away unsuccessful ones and don’t mislead others ideas even if they seem bad to you. This is the path toward good developments!

Here, the technical part:

How to design a real compiler!... wow! I never thought I could!
I saw many projects here and there, studied compiler generators like many good ones out there, but as I am working under C# and decided this would be my platform of choice. I saw no compiler or parser generator was generating acceptable C# code, so I started to port a simple one, like Lex, then I found many not-so-good programming techniques, and improved the thing, making it faster and better, (I thought: wow! I can find an error inside such a complex thing!) then I was faced to make a GLR parser, you know: is like a LR parser (down-top) but with the capability to handle ambiguous and irregular grammars (many, not all of them)

First I played around with Grammatica and Gold-Parser, two very good products, both open source (but not all of it), and learning from them what could not fit with NLP, also gave back a program to the owner of Grammatica, but I think he never published my utility, a very simple good one (I used it a lot) to create/edit/correct BNF-like grammars on-the-fly.

Then I found a good parser and some Scottish Thesis student sent me a skeleton for a GLR parser which he used to graduate, so I started not from scratch, but.. I found some errors, many of them were not even known by the authors, over the years I lost contact with him.

Finally I built a new core, modified heavily the code and ¡voilá!, I got my own parser generator!

This is great!, because only then you know all the tricks and magic, here and there to make it build any kind of things, and as you get this power of the compiler-art, you can get many weird things out of it, like making it compile conditionally: yes it’s possible!

Also, to compile ignoring some wrong or odd symbols, like inside our human-head!.. this is also possible, then you got a Robust parser! (anyone heard on this anywhere?, I didn’t)

[ul][li]Even further[/li][/ul], if you got this parser to get multiple input symbols (all at the same time) you got a Schrödinger-like input term (an input term who is simultaneously many things, with different probabilities). Thus I called it with this name: Schrödinger-Terms

[ul][li]Even better:[/li][/ul] you can make this parser end any internal tree on semantic non-congruences, so you can make it compile only “reasonable stuff” much like our limited but smart human brain.
[ul][li]Here is more of it[/li][/ul]: you can also make this compiler look with limited memory (a starving one) so you get a limited tree height, which may be concordant with experimental studies of the Brown corpus, and then….

You’ve got an (nearly) intelligent NLP compiler, capable of dealing with one of the most horrendous grammars like Spanish, with over 600 productions, capable of handling 240 input terms, all working in single blazing-speed, and never explode as NP-complete, because it wont explore any non-concordant noun-phrase or a senseless adjective applied to a noun.

This works better than a statistic parser after a pos-tagger, because you compile all without missing the exception, and the context may lead you into the correct parse, which you may have missed by mislabeling a word, due to the high frequency ‘other’ pos tag.

This parser may be the skeleton of my PhD or Doctoral These, some years in the future!

So guys, this is the direction towards I am heading to. (wish me luck!)

Hope this helps encouraging more people to win and defy more tech things!

< 1 2

2 of 2

‹‹ Howdy and a question about the business of chatbots and AI Virtual Humans Forum co-operation? ››

Search the Forum

Forum Profile

Forum Subscription

Forum Moderators

On Our Admin Forums

Partner Forums

Science Statistics

Chatbot Statistics