Re: questions about 2007 Legal Track routing & interassessor (in)consistency



> From: Dave Lewis (address for public mailing lists) <misclists1@daviddlewis.com>
>
> As we've been discussing, interassessor consistency is bad and/or  
> strange for many 2006 Legal Track ad hoc topics.   That raises  
> several questions:
>
> 1.  Should we omit some of these topics from the 2007 Legal Track  
> routing evaluation because we already have evidence that they are  
> susceptible to high interassessor inconsistency?  This would probably  
> mean the same topics should be dropped from the test collection  
> entirely, since the 2006 ad hoc pool is of questionable quality (only  
> 6 participants, many technical problems).  If so, what should the  
> test be for dropping a topic.

My view is the only reasons to drop topics are
1. the very unlikely event of an assessor giving us random
answers.
2. we don't have the assessing budget to do all.

While the inconsistencies have important consequences for the end
user task and has to be addressed because of that, we don't have any
information on how it affects system evaluation. In the past, it has
had very little effect on ad hoc evaluation; I see no reason to expect
a big change here.  Systems may not do as well on some topics
where the training assessments don't match the testing
assessments, and I can envision scenarios where that may be
important, but in practice I wouldn't expect those scenarios to
matter at the current level of system performance.
I think that dropping the number of topics will have a bigger
negative impact on the reliability of the results.

> 2.  I had envisioned that, independent of the style of routing  
> evaluation (e.g. residual collection), that the union of 2006 ad hoc  
> qrels and the 2007 routing qrels would be used as the qrels for these  
> topics going forward.  But maybe this is unrealistic given the high  
> levels of interassessor inconsistency.  So how should the final qrels  
> for the collection be produced:
>
>      2a. Go ahead and take the union?
>
>      2b.  Distribute two alternate sets of qrels with the collection,  
> one based on 2006 ad hoc and one based on 2007 routing?
>
>      2c. Distribute only the 2007 routing qrels?
>
>      2d. Take the union of the 2006 ad hoc relevant with the 2007  
> routing relevant and nonrelevant (a kind of maximally broad  
> definition of relevance).

My preference is 2b, without some concrete reason to do 2a.  A
possible concrete reason might be we just don't have enough
judgements in 2006 to evaluate systems.  I don't believe this to
be the case.  Depending on resources, the amount of judging we
can do on the 2007 routing runs might be quite limited, and might
not stand up on its own.  In that case we might want to merge.

Relevance really is a function of document, query, and user. I
think we should respect this as much as we can.  Keeping them
separate allows us to.  

In future use of the collection, I view the knowledge that
different "users" were involved to be important.  A retrieval
system should act differently given some factor that some users
find critical and some don't, as opposed to a factor that all
users find somewhat helpful.  Having separate qrels would be a
very helpful investigatory tool when systems advance to the point
of wanting to explore this.

If a researcher has some concrete reason they need more relevance
judgements for an experiment, they can merge the qrels
themselves.  I would prefer for most uses in papers that
the two qrels be regarded as separate, related collections.


> 3. If the answer to 2 is 2a, 2b, or 2d, should qrels from 2006 ad hoc  
> Assessor 2 be thrown in as well?

Strongly against this.  They're non-random partial judgements.

> 4. If the answer to 2 is 2b or 2c, should we have the relevant from  
> 2006 ad hoc thrown into the 2007 routing pools to be reassessed, as a  
> particularly rich source of relevant documents.

Not because they're a rich source of relevant documents, no.  I
see no major reason why we should regard the 2006 judgements as
"bad" judgements. The "lawyers being legalistic" argument is a
reason; I don't think it's major, but I don't think we
know. However, that's a reason to include the NON-relevant
documents to be reassessed, not the relevant documents.

Given infinite resources, I have no objection to replacing the
2006 judgements.  I'm sure we can improve them.  But we don't
have infinite resources.


> 5. Do the answers to the above questions change the best strategy for  
> evaluating routing.  In particular if we adopt 2c (with either answer  
> to 4), is residual collection evaluation still necessary?

Yes, I'm not sure what has changed.  The issue is artificial
evaluation effects on the routing task.  Those still exist
whether or not the 2006 judgements are released.  Eg, the
duplicate document problem remains a problem.


> 6. Should additional studies of interassessor consistency be built  
> into the routing evaluation?  If so, what?   Should we keep the  
> option open of omitting some 2006 routing topics from the final  
> collection?

Additional studies are a function of resources available.  If
we're going to change the topics to help consistency in the
ad hoc 2007, I think we need most of the resources there.


Thanks for bringing up all these points so thoroughly, Dave!

Chris



Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov