Re: ACE scoring modification -- version 7 slays the Sly Fox


>
>
>Regarding point (3), I think it's rather analogous to saying "Al Gore would
>have won the last election if we were counting popular votes rather than
>electoral votes." Perhaps that is superficially true, but ignores the fact
>that both candidates would have waged vastly different campaigns had the
>"evaluation metric" been different.
>
>I'm not sure how much of a difference the different metric would have made
>in system design and output, but changing the rules after the game has been
>played doesn't make it easier to understand the results, even if you do go
>back and recalculate them.
>
>Besides, the change is, in my opinion ill motivated because it's essentially
>to prevent someone from doing what nobody would do anyway.
>
>If the pronoun mention problem bothers anyone, I would suggest that the best
>solution is simply to report the number of classification errors for
>mentions of each type. Excluding a few instances of headless nominals and
>one-anaphora, pronouns are extremely easy to classify (It's a closed class,
>after all). Nobody should be making a lot of mistakes there.
>
>Also, I'm not sure I understood whether George meant that the difference in
>scores was 2-6 percent, or 2-6 percentage points of score. If the latter,
>then it has much more than a "minimal effect" on the results
>
>                                - Doug
>
>
>
>
>
>  
>
There are two distinct issues here: 1. Which score does the best job?; 
2. How can we make
sure the competition is fair? I basically agree with you that changing 
the score in the middle
is unfair with respect to issue 2. However, issue 1 is a research issue. 
In this particular instance,
I don't really care whether we change the score or not if the effect is 
indeed minimal, i.e., for the
sake of argument, let's assume that it does not change the ranking of 
the systems. However, if
a new score does effect the ranking, the strategy for winning the 
bake-off, etc., then the change
should be justified with respect to issue 1. For example, suppose it 
turned out that the most
important research issues were being clouded by a score that 
overwhelmingly favored named
mention recognition and let's suppose everyone agreed that named entity 
recognition was
an old and mostly solved research question. Suppose that there was an 
easy way to neutralize
this effect and everyone agreed that the new score more adequately 
reflected success at
the "real" task. Then, I would say that changing the score would be 
completely justified
regardless of the amount of work that everyone spent on tuning their 
systems to the old score.

Adam






Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov