|
Posted: Jul 28, 2016 |
[ # 61 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
I may have an explanation for Denis’ score: Only 19 of 60 pronouns referred to the candidate nearest to the pronoun. So if you chose the grammatically correct default instead of guess completely at random, you’d have about 30% correct. Is that what happened?
Either way this means that in this test, making a program that is usually correct in normal everyday practice is discouraged more than random guesswork. Also, the abundance of other-than-nearest choices suggests that you could do better than guesswork using reverse psychology: When in doubt, pick one of the answers that is not the nearest. 32 times the answer was “A”.
I’m still analysing, but it does look like Merlin’s hypothesis is correct. There is a distinct difference between the solutions to prose and Winograd schemas in my results.
|
|
|
|
|
Posted: Jul 28, 2016 |
[ # 62 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
Thanks for that, Don. I don’t have a bot in this race (I haven’t been able to work on Morti for quite a while now, I’m afraid), but I’ve been following this thread silently since it’s beginning, and I’m wondering if this may be a situation where the bulk of the questions represent “edge cases”. If so, I’m further wondering if there might be a way to detect whether a given question falls into the category of an edge case, and if so, perhaps treat it as such. I’d have to dig into the data a lot before I could make a determination, and I simply don’t have the time to do so, but if someone else who has already explored the questions to some degree had the time and/or interest in checking, that might be a good thing (nudge nudge, wink wink…)
|
|
|
|
|
Posted: Jul 28, 2016 |
[ # 63 ]
|
|
Guru
Total posts: 1081
Joined: Dec 17, 2010
|
You are right Dave. I thought I had a couple of ways to handle the bulk of the questions, but I did not have the time to dig in deeply. Maybe between now and the time the Loebner finalists are announced I will dust off some code a run a test case to see how close I could get.
|
|
|
|
|
Posted: Jul 29, 2016 |
[ # 64 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
There were only a few questions that I thought of as doubtful edge cases, that relied on subtleties or mere assumptions. One could handle edge cases by means of probability. Though in normal practice I would just have my program ask.
The questions now also list the answers and how many humans got them correct (most to all).
2. Always before, Larry had helped Dad with his work. But he could not help him now, for Dad said that his boss at the railroad company would not want anyone but him to work in the office.
(Wouldn’t one expect Dad to return the favour? The second pronoun’s referent doesn’t become clear until we solve the fourth ambiguous pronoun simultaneously)
18. All the buttons up the back of Dora’s plaid dress were buttoned outside-in. Maude should have thought to button her up; but no, she had left poor little Dora to do the best she could, alone.
(Does “to” mean “in order to do something”?)
34. Alice was dusting the living room and trying to find the button that Mama had hidden. No time today to look at old pictures in her favorite photo album. Today she had to hunt for a button, so she put the album on a chair without even opening it.
(Do children have photo albums? Or does “her” only apply to “favourite”?)
40. Every day after dinner Mr. Schmidt took a long nap. Mark would let him sleep for an hour, then wake him up, scold him, and get him to work. He needed to get him to finish his work, because his work was beautiful.
(Do we assume that Mark is his apprentice or employer?)
If you ask me, the main problem that everyone had is that they prepared for one thing and then were tested on another. I doubt that these people would have entered if they hadn’t already achieved scores over 65% on their own tests.
|
|
|
|
|
Posted: Jul 29, 2016 |
[ # 65 ]
|
|
Guru
Total posts: 1081
Joined: Dec 17, 2010
|
It is a shame that something billed as the winograd schema challenge did not test the entrants on winograd schemas.
http://whatsnext.nuance.com/in-the-labs/winograd-schema-challenge-2016-results/
It would have been interesting to see how they did on the winograd questions vs pronoun disambiguation.
|
|
|
|
|
Posted: Jul 29, 2016 |
[ # 66 ]
|
|
Guru
Total posts: 1081
Joined: Dec 17, 2010
|
I took the morning and cleaned up part of the pronoun disambiguation code I was looking at for the challenge.
Although not above the 90% threshold, it does generate better results than any of the participants.
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> ==== RESTART ====
35 / 60 = 0.583333333333 Fri Jul 29 15:31:57 2016 >>>
If I have a chance this weekend, I’ll look into it more.
http://whatsnext.nuance.com/wp-content/uploads/winograd-schema-challenge-participants-results.png
http://whatsnext.nuance.com/in-the-labs/winograd-schema-challenge-2016-results/
|
|
|
|
|
Posted: Jul 31, 2016 |
[ # 67 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
58.33% equals Quan Liu’s score (We wouldn’t want to discount a program on account of poorly punctuated input), which is still remarkable. I’m curious about your methods.
Here is my detailed account of the event and my program’s performance.
I’ve left out a number of gripes, as well as 7 malfunctions that I had with the XML interface, as they would not have affected the outcome.
|
|
|
|
|
Posted: Jul 31, 2016 |
[ # 68 ]
|
|
Guru
Total posts: 1081
Joined: Dec 17, 2010
|
Don,
I took another look at the code today.
Since I don’t usually give the detail behind my methods, I will have to think about it.
I did like your write-up.
But, I have also improved to 61+%.
Run time is about 2 seconds.
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32 >>> Start: Sun Jul 31 17:38:23 2016 37 / 60 = 0.616666666667 End: Sun Jul 31 17:38:25 2016 >>>
|
|
|
|
|
Posted: Aug 1, 2016 |
[ # 69 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
A two-word description of the kind of approach would probably satisfy my curiosity. Just wondering if you’re using handcoded rules, or an ontology, or rote learning of verb combinations, or a neural network. That sort of thing.
|
|
|
|
|
Posted: Aug 1, 2016 |
[ # 70 ]
|
|
Guru
Total posts: 1081
Joined: Dec 17, 2010
|
A two-word description of the kind of approach: “Brain Dead”
No ontology, no rote learning, no dictionaries, no neural networks.
State-of-the-art results with just a few hand-coded rules.
|
|
|
|
|
Posted: Aug 1, 2016 |
[ # 71 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
Considering which solutions worked best on my end, I find that entirely plausible
|
|
|
|
|
Posted: Aug 2, 2016 |
[ # 72 ]
|
|
Senior member
Total posts: 308
Joined: Mar 31, 2012
|
“Does your software require the use of any special hardware that you would need to bring to IJCAI to run the tests?
If not, can you provide us with an executable ahead of time so that we can run on at least parts of the test beforehand? What would be your preferred way to get us an executable? (Note that the tests would probably be run on standard laptops: Macs or PCs). We have the room for the challenge only for the morning of the 12th and would probably not have time to test everyone during that period.”
############################################
To me this is unprofessional and unacceptable on several levels.
They want to “Test” your bot beforehand? What kind of questions would be posed and how would any results possibly sway any judges?
They state that not everyone will have time to be tested. What about the others?
What criteria exactly are they looking for during these “Pre-tests” and why can’t everyone else be included?
This venue seems to be not very well thought out or planned as the rules of engagement seem to keep changing.
As usual, time will tell. Good Luck!
|
|
|
|
|
Posted: Aug 2, 2016 |
[ # 73 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
I’d imagine that the “tests” are more or less just to be assured that the bots will function, as opposed to actually checking for response quality, etc. But I agree that some way needs to be found to get all bots that rely on “executables” (doesn’t that really mean ALL entries? Or are some just operated via magic, and phlogiston?) tested.
I see this (probably over-optimistically) as just a working out of the inevitable kinks that crop up, but like you said, Art, time will tell.
|
|
|
|
|
Posted: Aug 2, 2016 |
[ # 74 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
Shortly after asking to send the entries in early, they withdrew that request and asked the contestants to bring their laptop to the event instead. It was at that point that I had some stern words with them as this was unprofessional indeed.
Beforehand testing, as I understood, would simply have been the same multiple choice test as they now have run at IJCAI, just not on location. For me this would not have been a problem, but I imagine it would have been impossible for contestants with huge databases or specialised hardware.
Here is a rather respectable interview with one of the organisers.
|
|
|
|
|
Posted: Aug 2, 2016 |
[ # 75 ]
|
|
Guru
Total posts: 1081
Joined: Dec 17, 2010
|
I agree Art,
There were a number of problems with this contest. Originally: Contest Details
The test will be administered on a yearly basis by CommonsenseReasoning.org starting in 2015. The first submission deadline will be October 1, 2015. The 2015 Commonsense Reasoning Symposium, to be held at the AAAI Spring Symposium at Stanford from March 23-25, 2015, will include a special session for presentations and discussions on progress and issues related to this Winograd Schema Challenge. Contest details can be found at http://commonsensereasoning.org/winograd.html.
Prizes
The winner that meets the baseline for human performance will receive a grand prize of $25,000. In the case of multiple winners, a panel of judges will base their choice on either further testing or examination of traces of program execution. If no program meets those thresholds, a first prize of $3,000 and a second prize of $2,000 will be awarded to the two highest scoring entries.
The start date for the contest kept moving, the input and output formats changed, and a pronoun disambiguation round was inserted before you even got to test against the Winograd Schema Challenge. It was in flux until the last minute. Even then, there were problems with the xml used. From the organizers: A problem was discovered at the last minute with unexpected punctuation the XML input impacting a handful of questions.
These changes, new requirements, and that they picked the same deadline as the Loebner caused me to drop out and echo the same feelings as Don expressed: Don Patrick - Jun 11, 2016: Jesus flippin lunatics have just been wasting months of my time by changing all the rules at the last minute. They’re requiring my presence across an ocean, they’ve tripled the difficulty of the questions at the last moment, they’re practically demanding to see my code, and they’re still making up the rules while I have to ship my program by monday to arrive in time. I am not okay with any of this.
I’m done with this crap organisation.
I salute Don for having the tenacity to push through.
It would have been nice though if the Winograd Schema Challenge actually tested the ability of each program to do Winograd Schemas. At least then the entrants would have gotten some feedback.
As I understand it, the next contest will be run is 2018.
You need to get over 90% right on the PDP part to get a chance of running the Winograd Schema Challenge.
Prizes: The grand prize of $25,000 will be awarded to the first team to achieve a score of 90% in both rounds of the contest. If more than one team accomplishes this, the prize will be awarded to the team with the higher score in the second round. If tied in the second round, the prize will go to the team with the higher score in the first round. If both rounds are ties, the prize will be split.
At IJCAI-2016 three smaller prizes, of $1000, $750, and $500 will be awarded to the top three programs that score over 65% on the first round of the contest.
Maybe Nuance is trying to save money.
|
|
|
|