Re: interassessor consistency data on TREC 06 Legal track ad hoc topics
Quoting "Dave Lewis (address for public mailing lists)"
<misclists1@daviddlewis.com>:
> Anticipating this concern, during the 2006 Legal Track ad hoc
> evaluation we had a sample of the pool for each topic assessed by two
> assessors. The sample consisted of 25 documents judged relevant by
> the first assessor (or all such documents if fewer than 25), and
> enough nonrelevant to bring the sample to 50 documents (49 in one
> case due to a glitch).
I realize this is difficult when the sample is drawn this way, but have you
tried measuring the runs using this data, and seeing if they rank differently?
(I'm not sure something like infAP or bpref is good here when the samples are so
small, but maybe just set precision on the sampled documents... Just a
before-the-coffee thought.)
Ian
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov