A useful formula for ‘goodness of fit’ between keywords and text is this:

** GOF = (u/n) * (r/f)**

Where n is the number of keywords, u is the number found in the text, r is the number of words consumed by the matching, and f is the number of words available (between first and last match). This version of GOF is simply the ratio of consumed keywords, multiplied by the ratio of consumed text words. I find that replacing n by **max(n, 2)** avoids over weighting of single keyword cases.

Since “keywords” may be several words long, the formula will give different answers depending on which words are grouped as single “keywords”. In general for sets of words A and B, this is of the form:

(|A and B|/|A|) * (|A and B|/|B|)

Although not intended for text to text matching, one could use the original text words as ‘keywords’ and use the formula to match the new text. Different grouping of words give different results but this is worth thinking about. I am ignoring the “devil in the details” of inter-word distance, dull words to skip, extra text before and after, etc. If you are interested, let’s discuss.