Re: questions about 2007 Legal Track routing & interassessor (in)consistency
Hi -
maybe another approach to inconsistency would be working with graded
relevance? 1R2R could mean "strongly" relevant and 1R2N / 1N2R "weakly
relevant". One would not need to dramatically change the evaluation metrics.
This of course does not solve issues of the second assessor seeing a
specially selected document set.
- Kal
Dave Lewis (address for public mailing lists) wrote:
>
> As we've been discussing, interassessor consistency is bad and/or
> strange for many 2006 Legal Track ad hoc topics. That raises several
> questions:
>
> 1. Should we omit some of these topics from the 2007 Legal Track
> routing evaluation because we already have evidence that they are
> susceptible to high interassessor inconsistency? This would probably
> mean the same topics should be dropped from the test collection
> entirely, since the 2006 ad hoc pool is of questionable quality (only 6
> participants, many technical problems). If so, what should the test be
> for dropping a topic.
>
>
> 2. I had envisioned that, independent of the style of routing
> evaluation (e.g. residual collection), that the union of 2006 ad hoc
> qrels and the 2007 routing qrels would be used as the qrels for these
> topics going forward. But maybe this is unrealistic given the high
> levels of interassessor inconsistency. So how should the final qrels
> for the collection be produced:
>
> 2a. Go ahead and take the union?
>
> 2b. Distribute two alternate sets of qrels with the collection, one
> based on 2006 ad hoc and one based on 2007 routing?
>
> 2c. Distribute only the 2007 routing qrels?
>
> 2d. Take the union of the 2006 ad hoc relevant with the 2007 routing
> relevant and nonrelevant (a kind of maximally broad definition of
> relevance).
>
>
> 3. If the answer to 2 is 2a, 2b, or 2d, should qrels from 2006 ad hoc
> Assessor 2 be thrown in as well?
>
>
> 4. If the answer to 2 is 2b or 2c, should we have the relevant from 2006
> ad hoc thrown into the 2007 routing pools to be reassessed, as a
> particularly rich source of relevant documents.
>
>
> 5. Do the answers to the above questions change the best strategy for
> evaluating routing. In particular if we adopt 2c (with either answer to
> 4), is residual collection evaluation still necessary?
>
>
> 6. Should additional studies of interassessor consistency be built into
> the routing evaluation? If so, what? Should we keep the option open
> of omitting some 2006 routing topics from the final collection?
>
> Dave
>
>
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov