Merlin take a good look at the file that I wrote a few years ago for converting wikitext markup into XML and which I have attached to this post. According to the documentation on the Wikimedia website there have been over 30 attempts to write a good library to replace the appalling mess of PHP that currently converts Wikipedia pages into HTML. By my reckoning I accomplished more in the program that I wrote for that purpose than anyone else has so far.
But that’s neither here nor there, the program was written in FLEX which is the GCC tool for processing regular expressions and it’s about as sophisticated as RE processing can get. I’ve also written complex and beautiful programs using POSIX RE’s, Advanced RE’s and Extended RE’s and their ilk in PostgreSQL, Perl and JavaScript, and regularly use Perl Compatible RE’s in my C programs. However in the case of the wikitext parser, I gave it up as a bad idea because although it worked well, it was turning into just as big a mess as the unmaintainable Wikimedia code that it was intended to replace.
To your credit, you’ve taken the trouble to do a bit of research to back up your claims. Unfortunately you have still not come up with anything to show that outside a very narrow range of applications, the use of regular expressions for complex parsing tasks isn’t doomed to fail, unless it is propped up by an ever growing tangle of bolted on (and more often than not) mutually incompatible kludges.
Most of those kludges “allow” you to hand code the complex operations that the computer ought to be able to code for you. It’s like having to get out of the car and walk to your destination when you should be able to drive all the way. Or maybe you don’t have a drivers licence yet and all you’re competent to do is back the car out of the garage. I’m going to go with that until you manage to prove otherwise.
Anyway, I haven’t revisited the wikitext problem with my CFG parser yet, though I have already implemented a number of fast and elegant parsers for much more complex problems using it. I even published all the source code for one of them in this forum, the parser for discourse analysis which you can download from here for comparison: http://www.chatbots.org/ai_zone/viewreply/7811/
The video presentation from Peter Norvig was very interesting, so thanks for that Merlin. In it he is pretty much saying the exact same thing that Laura and CR have been saying in this thread, but with visible hand waving. Did you watch the other videos in the series? In the next one he explains how the solution to the problems just described is to use Probabilistic Context Free Grammars. (No mention of Probabilistic Regular Expressions or supporting frameworks anywhere, though maybe Larry hasn’t figured out a way to bolt those on yet. Give him time, he is the king of kludges after all.)
In fact Professor Norvig is also wrong about this. Probabilistic “anything” has been very fashionable for more than a decade because of the ready availability of crunchable data from the internet, that and it’s so easy that even MBA’s can understand it and open their cheque books to fund the research. However James Allen showed very handily as far back as 1996 that all it was good for was speeding up parsing algorithms by a small but significant factor, and that it still didn’t solve any of the real problems (e.g. ambiguity resolution) satisfactorily by itself. (Chapter 7, “Natural Language Understanding”). As Professor Norvig is such a busy man, I guess he can be excused for being a little out of touch.
While linguists generally are still arguing vehemently about exactly how language works, there is one thing that they do all agree on, and that is that the use of context free grammar will be part of the solution. For the very latest theories on the subject, I invite you to do a bit of reading about “The Simpler Syntax Hypothesis” which was published a couple of years ago. The first chapter is available on the internet for free download. Another good book is the one that Jan pointed out last week “Basic English Syntax with Exercises” by Mark Newson which can be downloaded in its entirety (though personally, I think that chapter 3 is rubbish).