Erwin, Dave, dear all
I agree to create a new chatbot challenge, have some ideas to share, here is a point-wise list.
Chatbots are difficult to evaluate, just to fool a human is not good, also the other way by making them mimic a Query & Answer Engine is also not fair! (using pre-defined questions)
I think the agents deserve have better and rich scorings, and the final score should be a balanced mean of all them,
Here I give (in my opinion) the points on which I might evaluate this.
Intelligent Agent Platform
Based on tech-specifications like manuals, tech-sheets, public information, etc..
1- Quality of the Platform (how easy it is to code a kind-of behavior, answer, analysis)
2 - Multi-lingual capabilities of the platform
3 - Extensibility of the Agent-Data Access (native = built in) ie. how to interface with databases, web-services, and other data-sources.
4 - Flexibility of the Pattern-Matching mechanisms (if hard-coded, AI-trainable, plug-ins, etc.)
5 - Natural Language Capabilities (analysis and Generation)
6 - Speed of response, memory footprint, multi-user capability, session-memory, number of concurrent users, etc.
Then I would evaluate a set of different behaviors of an actual implementation of several different type of agents (targets).
A) General chat (entertaiment agent)
B) Specific Purpose Agent (targetted agent)
C) Query Answering Agent (based on a specific context, working as a help desk)
D) Artificial Intelligence, Inference and Cognitive Capabilities
Multilingual
English, Spanish and any other language are welcome, as long as we get at least 3 judges.
For all of them, we should not only see the quality of the responses raher to measure the quality of the conversational behavior, during the conversation, how this bot acts upon mistakes, misunderstandings, and how the chat-flows.
To achieve this I suggest for each one a different task/test
A) General chat (entertaiment agent)
Specify a free conversation, no turn-limits (only time : a few minutes, ie. 5),
The judge should talk to each Agent freely, targetting only a few subjects (pre-stated) like money + finnancial, work, human - familiar talk, nature, math and logic, sentiment matters, etc.
The score should be based on each conversation turn the judge belief:
1 - Agent understood / recognized the entry giving a good response or a successful iniciative.
2 - Agent missed the entry but continued successfully the conversation holding themes or context.
3 - Agent successfully re-passphrased the entry and tryed to understand by asking for clarification, or suggesting something.
4 - Total failure (Agent did’t get a clue)
5 - Agent answer was unexpected, might try to evade the fact he didn’t understand
6 - Agent got bad entries from the judge, and tried to guess what the F# was told to him, or answered correctly to this!
At final stage there will be a F-score like punctuation.
B) Specific Purpose Agent
There should be a target to hold, the bot shold be asked to fulfill something, simple like getting some information from the judge, and the judge should be able to do mistakes, mistypes, answer bad, even in a rude way. the Agent should be able to overcome this limitations in a nice-correct way, and get the goal done.
Score should be based on:
a) Number of turns to get the goal done.
b) Quality (subjective) of the way the agent treated the human. (0:bad, 1: difficult, 2: normal, 3: good, 4:very good)
c) Number of correct and failed turn-pairs
d) Robustness (# of good interpretations of the bad-mistyped-errors from the judges)
C) Query Answering (specific knowledge)
The bot-makers shold get a reference material (text), for which the answers are inside, or are deducible.
The judges should have a number of specific goals like getting some responses, they can ask the way they like, even in mulltiple turns allowing the Agent to refine the questions.
Score might be based upon:
a) Number of turn to achieve each goal (or miss the fact)
b) Number of goals achieved successfully
c) Quality (subjective) of the way the agent treated the human. (0:bad, 1: difficult, 2: normal, 3: good, 4:very good)
d) Robustness (# of good interpretations of the bad-mistyped-errors from the judges)
e) If the processing of the reference material was unattended, semi-supervised or manually supervised by the botmaster.
D) Artificial Intelligence, Inference and Cognitive Capabilities
This is the most challenging part, agent shold be able to make some resoning about relations, resolve anaphoric relations, have human-like memory, even forgetting things, associate memories and deduct new things and relations.
He should be able to ask lacking information to achieve a goal like an answer, or even find out what is lacking of wrong in a statement. For example there might be a story the judge tell the agent based upon the agents request, and the agent should be able to follow successfully the conversation and answer some judge’s questions, or spontaneously deduct associations, or new discoveries. the way the score wold be done is complicated, and Have not thought abbout it, if anyone could help,. welcome!
Hope this helps for getting a better challenge!
PD:
In my opinion, Judges shold be also botmasters, because they know how difficult it is to achieve each challenge!
Obviously Judges won’t judge their own bots (ethically-incompatible), but they might participate for others.
Agents should be anonymous, for the judges, and there should not be a distinctive question to tell one Agent from another.
I am also willing to participate whith my English-Spanish-Agent