Re: interassessor consistency data on TREC 06 Legal track ad hoc topics



I'm actually not worried by the overall numbers either.  But I
have 2 concerns about the pattern of the numbers.

1. There are more topics than I would expect that have
   substantial disagreements but are about even in 1N2R and
   1R2N numbers, eg
    18        80      689      20      7       5      18

  The more typical pattern in the past has been the lopsided
  differences (which we see here also).  One assessor is more
  lenient than the other on the scope or importance of some major
  factor.  Those don't bother me (but see below).  However, the
  topics like above are either actual factual disagreements (not
  just scope) or two factors are in play that the assessors
  disagree on.  I don't believe the latter.  In the past TREC
  tests, there have been fewer topics I categorized as factual
  disagreements (as I remember; I haven't checked yet for the
  actual statistics).

2. The number of lopsided topics (1N2R >> 1R2N and the reverse)
  doesn't both me, as I said, but the one-sided direction bothers
  me tremendously. 
  1N2R is 5 or more greater than 1R2N for 1 topic
  1R2N is 9 or more greater than 1N2R for 15 topics!

  That's an enormous discrepancy which is almost impossible to
  occur by pure chance.  The original assessor is much more
  lenient than the secondary assessor.  I don't know whether
  that's due to the original having better key words to look for,
  or the secondary having different expectations of relevance, or
  the secondary assessor doing a better job given fewer
  documents, or what.  But I believe the discrepancy calls into
  question whether we can trust these numbers at all.  There
  seems to be some unexplained systematic effect here.

Chris




Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov