Re: questions about 2007 Legal Track routing & interassessor (in)consistency



Hi -

maybe another approach to inconsistency would be working with graded 
relevance? 1R2R could mean "strongly" relevant and 1R2N / 1N2R "weakly 
relevant". One would not need to dramatically change the evaluation metrics.

This of course does not solve issues of the second assessor seeing a 
specially selected document set.

- Kal


Dave Lewis (address for public mailing lists) wrote:
> 
> As we've been discussing, interassessor consistency is bad and/or 
> strange for many 2006 Legal Track ad hoc topics.   That raises several 
> questions:
> 
> 1.  Should we omit some of these topics from the 2007 Legal Track 
> routing evaluation because we already have evidence that they are 
> susceptible to high interassessor inconsistency?  This would probably 
> mean the same topics should be dropped from the test collection 
> entirely, since the 2006 ad hoc pool is of questionable quality (only 6 
> participants, many technical problems).  If so, what should the test be 
> for dropping a topic.
> 
> 
> 2.  I had envisioned that, independent of the style of routing 
> evaluation (e.g. residual collection), that the union of 2006 ad hoc 
> qrels and the 2007 routing qrels would be used as the qrels for these 
> topics going forward.  But maybe this is unrealistic given the high 
> levels of interassessor inconsistency.  So how should the final qrels 
> for the collection be produced:
> 
>     2a. Go ahead and take the union?
> 
>     2b.  Distribute two alternate sets of qrels with the collection, one 
> based on 2006 ad hoc and one based on 2007 routing?
> 
>     2c. Distribute only the 2007 routing qrels?
> 
>     2d. Take the union of the 2006 ad hoc relevant with the 2007 routing 
> relevant and nonrelevant (a kind of maximally broad definition of 
> relevance).
> 
> 
> 3. If the answer to 2 is 2a, 2b, or 2d, should qrels from 2006 ad hoc 
> Assessor 2 be thrown in as well?
> 
> 
> 4. If the answer to 2 is 2b or 2c, should we have the relevant from 2006 
> ad hoc thrown into the 2007 routing pools to be reassessed, as a 
> particularly rich source of relevant documents.
> 
> 
> 5. Do the answers to the above questions change the best strategy for 
> evaluating routing.  In particular if we adopt 2c (with either answer to 
> 4), is residual collection evaluation still necessary?
> 
> 
> 6. Should additional studies of interassessor consistency be built into 
> the routing evaluation?  If so, what?   Should we keep the option open 
> of omitting some 2006 routing topics from the final collection?
> 
> Dave
> 
> 




Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov