Re: interassessor consistency data on TREC 06 Legal track ad hoc topics
- Subject: Re: interassessor consistency data on TREC 06 Legal track ad hoc topics
- From: cabuckley@sabir.com
- Date: Thu, 1 Feb 2007 08:33:31 -0500
I'm actually not worried by the overall numbers either. But I
have 2 concerns about the pattern of the numbers.
1. There are more topics than I would expect that have
substantial disagreements but are about even in 1N2R and
1R2N numbers, eg
18 80 689 20 7 5 18
The more typical pattern in the past has been the lopsided
differences (which we see here also). One assessor is more
lenient than the other on the scope or importance of some major
factor. Those don't bother me (but see below). However, the
topics like above are either actual factual disagreements (not
just scope) or two factors are in play that the assessors
disagree on. I don't believe the latter. In the past TREC
tests, there have been fewer topics I categorized as factual
disagreements (as I remember; I haven't checked yet for the
actual statistics).
2. The number of lopsided topics (1N2R >> 1R2N and the reverse)
doesn't both me, as I said, but the one-sided direction bothers
me tremendously.
1N2R is 5 or more greater than 1R2N for 1 topic
1R2N is 9 or more greater than 1N2R for 15 topics!
That's an enormous discrepancy which is almost impossible to
occur by pure chance. The original assessor is much more
lenient than the secondary assessor. I don't know whether
that's due to the original having better key words to look for,
or the secondary having different expectations of relevance, or
the secondary assessor doing a better job given fewer
documents, or what. But I believe the discrepancy calls into
question whether we can trust these numbers at all. There
seems to be some unexplained systematic effect here.
Chris
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov