Steve Worswick - Jul 10, 2011:
Merlin - I would strongly suggest that if that is the case, the method of judging this year is unfair.
You might be right. Since I don’t know the details I can’t say, but psychologists and marketing guys go through a lot of training on how to make survey and research work unbiased. As you know I spent some time over the last few years looking at how bot contests should work. What I found is that if you use the wrong methodology, for gross rankings it usually does not matter. But, for close rankings (like in this case), it could make a world of difference.
Unlike the CBC, the scores are not given for correct or not but for a subjective decision called, “Most Human-like”.
Steve Worswick - Jul 10, 2011:
Take for example, the “what is your name” question:
8pla - My name is 8pla .
Adam-L - Adam Harris.
ALICE - My name is ALICE.
ChipVivant - Chip.
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.
Eugene_Goostman - Call me Eugene. I am glad to talk to you!
Mitsuku - My name is Mitsuku .
Rosette - Chris
SEARS - Tell me more^13
Trane - trane
Tutor - My name is Robert.
Ultrahal - My name is Steve.
Zoe - You know who I am! This is a trick question.
Every bot apart from SEARS and Zoe answered correctly. Are you saying that only 4 would have got points for this? Surely, all 11 who answered correctly should have been awarded a vote? They might as well have pulled the names out of a hat!
This is a great example since a variation of the question shows up in 2 places.
‘They were then asked to determine which 4 answers were most human-like and to enter the number of the best entries into their audience participation handsets. “
I assume the handsets were somewhat restrictive because it was mentioned earlier that they couldn’t handle more than 10 entries. If they also did not allow ties and you had to pick 1,2,3,4 then you could get some thing like:
Everyone thinks “My name is . . .” is the most human response. But 1234 must be selected. Some Judges might have voted like this even if they could do ties.
ALICE - My name is ALICE.
Tutor - My name is Robert.
Ultrahal - My name is Steve.
Mitsuku - My name is Mitsuku .
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.
Then selection would be random or subject to subtle biases.
Cleverbot could be last since it didn’t really give a name.
“My name is Mitsuku .” could be ranked fourth because people may not have ever met a Mitsuku and there is a ‘space’ between the name and the period.
“My name is ALICE.” could be third because the name is in all caps.
That leaves “My name is Steve.” or “My name is Robert.” for first and second. Given a 50/50 chance for 1 or 2 if Robert lucked out and came in first and Steve came in second then the scores would be:
4pts-Tutor - My name is Robert.
3pts-Ultrahal - My name is Steve.
2pts-ALICE - My name is ALICE.
1pt-Mitsuku - My name is Mitsuku .
3pt difference on this question between 1st and last place.
If we assume the judges can vote ties, but only allowing for top 4.
If people more thought giving just a name was the most human:
ChipVivant - Chip.
Rosette - Chris
7pts/2 = 3.5 pts per bot (this might be likely since these were two of the top bots)
Next most human response, “My name is. . .”
Tutor - My name is Robert.
Ultrahal - My name is Steve.
ALICE - My name is ALICE.
Mitsuku - My name is Mitsuku .
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.
3pts/5 = .6 points per bot
2.9pt difference on this question
Now if we add in the “Russian Judge Problem”. . .
If the first 8 judges voted in some pattern where most of the bots were tied,
then the last judge decided:
Best- 4pts-
Zoe - You know who I am! This is a trick question.
2nd- 2.5
ChipVivant - Chip.
Rosette - Chris
4th- .2
Tutor - My name is Robert.
Ultrahal - My name is Steve.
ALICE - My name is ALICE.
Mitsuku - My name is Mitsuku .
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.
When scoring is tight, outliers have a huge influence. This is why in some sports the high and low scores are discarded. Moral of the story; how you run the contest influences the results.
In the case of the Loebner contest, I don’t know if you could take an interesting, well performing chatbot (like Mitsuku) and have it do well. It may require a “dumbed down” or custom version to be really successful.