Re: ACE scoring modification -- version 7 slays the Sly Fox
- To: Adam Meyers <meyers@cs.nyu.edu>
- Subject: Re: ACE scoring modification -- version 7 slays the Sly Fox
- From: "Douglas E. Appelt" <appelt@AI.SRI.COM>
- Date: Thu, 15 Aug 2002 13:52:02 -0700
- CC: Multiple recipients of list <ace_list@nist.gov>
- Content-transfer-encoding: 7bit
- Content-type: text/plain; charset="US-ASCII"
- In-Reply-To: <3D5BF6B9.8010509@cs.nyu.edu>
- User-Agent: Microsoft-Entourage/10.1.0.2006
on 8/15/02 11:45 AM, Adam Meyers at meyers@cs.nyu.edu wrote:
> Douglas E. Appelt wrote:
>
>> George,
>>
>> I'm rather against changing the fundamentals of the scoring algorithm in
>> this respect for the following reasons:
>>
>> (1) I think we can rely on people to not game the scoring system in any
>> unreasonable way such as declaring all mentions to be pronouns.
>>
>> (2) There are a considerable number of mentions for which it is simply
>> unclear as to whether they are names or nominals, and a decision by the
>> annotators based on criteria like capitalization is essentially arbitrary.
>> For example is "US Congress" a name, or a nominal? Should it matter whether
>> "congress" is capitalized in the text? In the ASR transcripts,
>> capitalization is very inconsistent.
>>
>> (3) This muddies up the comparability of results from evaluation to
>> evaluation, which is in my opinion, the strongest reason for not doing this.
>>
>> - Doug
>>
>> on 8/15/02 8:58 AM, George Doddington at doddington@nist.gov wrote:
>>
>>
>>
> I tend to agree on points (1) and (2). To elaborate on (2), there are
> certain pronouns which anotators
> sometimes mark as nominal, in particular indefinite pronouns. I suppose
> we could fix these automatically.
> Looking at all of version 6 except EELD as a sample corpus (I am
> currently using this as test data),
> "anyone" is marked 34 times as a pronoun and 3 times as a nominal;
> "someone" is marked 33
> times as a pronoun and 7 times as a nominal. "somebody" - 10 and 2;
> anybody - 12 and 1; "one" - 153 and 15.
> Some other pronoun/nominal errors are harder to detect than these.
>
> I am less worried about point 3 because, as long as one can calculate
> previous results using the
> same scorer we can make the necessary comparisons. Furthermore, part of
> EDT research
> is trying to find a "good" score that accurately predicts how well a
> system achieves coreference, etc.
> So I don't think that changing the score in itself is a bad thing.
> However, when we do change
> the scoring function, it seems necessary to answer the question: Why is
> this a better score to
> tune our systems to? Normally, I would not think that covering up some
> "game" is a sufficient answer
> to this question, since we can just all agree not to play that game.
> However, if as George suggests,
> it has a minimal effect on previous scores (and previous rankings?), I
> don't see any harm.
>
> Adam
>
Regarding point (3), I think it's rather analogous to saying "Al Gore would
have won the last election if we were counting popular votes rather than
electoral votes." Perhaps that is superficially true, but ignores the fact
that both candidates would have waged vastly different campaigns had the
"evaluation metric" been different.
I'm not sure how much of a difference the different metric would have made
in system design and output, but changing the rules after the game has been
played doesn't make it easier to understand the results, even if you do go
back and recalculate them.
Besides, the change is, in my opinion ill motivated because it's essentially
to prevent someone from doing what nobody would do anyway.
If the pronoun mention problem bothers anyone, I would suggest that the best
solution is simply to report the number of classification errors for
mentions of each type. Excluding a few instances of headless nominals and
one-anaphora, pronouns are extremely easy to classify (It's a closed class,
after all). Nobody should be making a lot of mistakes there.
Also, I'm not sure I understood whether George meant that the difference in
scores was 2-6 percent, or 2-6 percentage points of score. If the latter,
then it has much more than a "minimal effect" on the results
- Doug
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov