AI Zone Admin Forum Add your forum

NEWS: survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Goodness of Fit metric for matching keywords to text.

A useful formula for ‘goodness of fit’ between keywords and text is this:

  GOF = (u/n) * (r/f)

Where n is the number of keywords, u is the number found in the text, r is the number of words consumed by the matching, and f is the number of words available (between first and last match). This version of GOF is simply the ratio of consumed keywords, multiplied by the ratio of consumed text words. I find that replacing n by max(n, 2) avoids over weighting of single keyword cases.

Since “keywords” may be several words long, the formula will give different answers depending on which words are grouped as single “keywords”. In general for sets of words A and B, this is of the form:

  (|A and B|/|A|) * (|A and B|/|B|)

Although not intended for text to text matching, one could use the original text words as ‘keywords’ and use the formula to match the new text. Different grouping of words give different results but this is worth thinking about. I am ignoring the “devil in the details” of inter-word distance, dull words to skip, extra text before and after, etc. If you are interested, let’s discuss.


  [ # 1 ]

Thanks. I’m working on a TrumpBot that scrapes his tweets, then maybe I can compare input text fit to tweets to respond with the tweet that has the best goodness of fit to the input? Another idea: collect internet posts that solicit the same basic response from me, then use goodness of fit to compare unseen posts to the posts I replied to, so I can reply again. Then I can troll beyond the grave!


  [ # 2 ]

Do it. You may also consider word substitutions to spice things up a bit.


  [ # 3 ]

A couple of details.

Dull words like “a, the, an, did” can be removed from the text to increase the “signal”. Also consider removing conjunction like “and, also, then, but” or consider splitting text at such “control” words.  You get a sequence of scores that can be averaged.

For the most part word-word distance can be ignored when between controls.

If you’re in Python, I could write a compare(textA, textB) as part of advertising my GitHub project. (Of course handling negations is more complicated smile)


  [ # 4 ]

With real world data, I’m wondering how much math transformations add to simple counts of words? I have big plans for synonyms, because my logic agent does synonyms smile

The thing I’m running into scraping tweets is encoding problems, some characters like apostrophe end up with some strange final encoding that turns them into a string of line noise. Also I want to update the tweets as the tweeter adds new ones, so I want to automate the scraping. Then in the tweets there are lots of links. How do those affect the percentages in the GOF formula you provide?

Anyway I’m still in the early stages of downloading tweets and figuring out how to update automatically. I’m practicing on Loebner’s tweets smile Dump all his web writings into a database then figure out how to trigger appropriate responses to input ... LoebnerBot!

Edit: If you wanted to provide some pseudo-code or Python code for a function that would yield a GOF score for two arbitrary strings of text, that would be great.


  [ # 5 ]

I have run into strange characters when scraping web pages. I tried to treat them as synonyms.

About writing that function: I am hoping to make it a priority.

I would be interested to hear more about your synonym structures. In the GOF formula above “keyword” means something represented by a list of synonyms. So the formula is intended for synonym-text matching rather than text-text matching.


  [ # 6 ]

So I have a bot now that I can add strings to (I imagine internet posts), and label them as something:

  “blah blah blah” = Foo

and then I can have responses:

  Foo provokes “nyah nyah nyah”

I start by downloading internet comment threads with a provocative post that I (say) responded to.

The bot reads the thread and tells itself:

“Provocative post” = Category A

Then it reads the response and tells itself:

Category A provokes “Response post”

When I get some new input, I categorize it, perhaps using a Goodness of Fit algorithm to classify it in the same classification as the nearest post (calculated by Goodness of Fit) that was already read?

The bot has a dialogue with itself:

“New post” = Category A

What does Category A provoke?

Category A provokes “Response post”

Thus I can recycle my old posts, automatically using them to reply to new posts that say substantially the same thing as posts I already replied to ...

I guess I need a Text Categorizer. I wonder how well Goodness of Fit can categorize text posts ...

(I’m still working on the details of getting the bot to associate strings with categories and what they provoke, but if all goes well I’ll be thinking about how to categorize the strings soon ...)

Edit: synonyms come in because I can make whole posts synonymous, or I can have individual words be synonyms: “synonym is a synonym for equivalent word”. Then if “equivalent word” appears in one text and “synonym” in the other, I can count those two occurences as the same for the purpose of counting words.


  [ # 7 ]

Hi Robert. Glad we are still talking about this. 

I am trying to follow and may need to think about it. If I understand, you have a collection of known examples in Category A (and maybe you also have Category B, C, and others) and want to categorize a new example and use a standard response, one per category.

GOF (goodness of fit) is exactly what you need when you have enough synonyms and want to categorize new input by topic. But building up synonym lists can take time and that is one of the problems to solve.

You might also be want to categorize, not by topic, but by style of expression/manner of speaking. For example Bernie Sanders uses the same phrases over and over with different topics plugged in. Like “going forward with X” with X as a variable. [I try not to listen to Trump but I am sure he uses a very limited set of words and of phrases.]  I could try writing some code for this, if it is needed. It might be one of the ways to discover synonyms….by adding X to a list somewhere.

I am almost done with a code speedup and then I hope to be able to focus on the details of what you are talking about.


  [ # 8 ]

To address something you said earlier about simple word counting. You are right and the GOF formula is just a a way to combine the counts. On the other hand there is a little hidden magic in it because of the way it counts word combinations.


  [ # 9 ]

Hey Peter, I am still working on some devilish details involved with inputting text to my bot, and so I haven’t yet gotten to the part where I need the GOF metric. When I do I want to revisit this thread ...


  [ # 10 ]

Me too. I want to try establishing email communication.


  [ # 11 ]

I believe the state of the art in search engine metrics is Okapi BM25 (BM stands for Best Matching).


  [ # 12 ]

Hi Merlin:
Thanks for the link. One thing about my formula is that I understand it smile

I note that BM25 is organized around large documents that have many irrelevant terms. Then I read (and stop at) this from Wikipedia:

“BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms [keywords] appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity)”

By contrast, the above GOF formula puts strong weight on “relative proximity” within a single short bit of text - like single sentences. And, for what it is worth, GOF is an absolute score not a relative ranking.

I assume the needs for ranking large numbers of long texts (“documents”) is advantageous to search engines. Personally, I have trouble with search engines that no longer define “relevance” in terms of word proximity - but that is a different rant.

- Peter W


  [ # 13 ]

I agree that a Bag Of Words (BOW) is limited.
“dog bites man” vs “man bites dog” issue. 2 very different things.

Some people encode multiple words or word position, of course this adds complexity.
I tend to use a Bag Of Phrases (BOP) when I need it.

Another popular search/query metric is term frequency–inverse document frequency.–idf

Often, “stop words” are filtered out before the metric.


  login or register to react