RE: evaluating routing when the document collection is reused



Some thoughts on routing evaluation, mainly on Chris's comments on Dave's mail.


> -----Original Message-----
> From: ireval@nist.gov [mailto:ireval@nist.gov] On Behalf Of
> cabuckley@sabir.com

> While doing extensive relevance feedback experiments, I gave up
> on doing anything other than "residual collection".  There were
> just too many evaluation artifacts when doing anything else

> So I would be against A, C, and D on those grounds.

I'm inclined to agree with Chris on A, C, D.

> (including split collection, but that was due to size of
> collection and number of relevant documents, which are not a
> problem here).

I think this is important -- agree that with few (rel) docs splitting the collection is not so good, but if there are enough reldocs I think it gives a much cleaner evaluation.

> I would probably be slightly against A and E on the grounds of
> assessor efficiency and consistency.  Either you have the
> inefficiency of assessors rejudging documents judged the previous
> year (and need to deal with what happens when they disagree), or
> you have two assessors judging one topic, and you'll be
> evaluating over both including their possible disagreements in
> interpretation of what's relevant.

In terms of Dave's desire for collection building, I think this problem is just as relevant to B as it is to E.  That is, with both we would end up with two sets of assessments (different assessors) for each topic; with B they would be disjoint, but that doesn't make them any more compatible.  So it would be just as difficult to know how to deal with them for future evaluations.

> B will have the two assessor problem, but the evaluation will all
> be the second assessor; that's cleaner in my opinion.

Although it might not be the most efficient way to do things, there are at least some potential gains from having the same documents rejudged.  It would be possible to agree that the evaluation for this particular task is going to be _only_ on the new judgements.  Afterwards there would be some interesting studies to be done on disagreements (in fact I would guess that the lawyers might be interested in this as well).

> We also have the problem of duplicate (and near duplicate)
> documents. For most (but not all) of last year's judgements, if
> documents were close to each other in similarity, they would both
> be judged or not judged.  So this problem will have a bigger
> impact on E than B (residual collection) since B will not
> evaluate on either of a duplicate pair if they both were judged,
> but E might use one for training and one for testing. Thus you
> have an artificial effect with E.

I agree that this is an issue.  Do we know how serious a problem it is in this collection?

> > Date: Fri, 26 Jan 2007 12:50:23 -0500 (EST)
> > From: "Dave Lewis (address for public mailing lists)"

> >        E. "Test and Control": Splitting entire document set into
> > training and test halves.  Only assessed documents that fall into the
> > training set of documents would be used for training, and profiles
> > would only be used to rank the test documents.  Some of the
> > previously judged documents would be re-retrieved from the test
> > document set, but that's OK because they wouldn't have been used for
> > training.  This would have the disadvantage of only improving
> > judgments on the test half of the collection, though one could do
> > this twice, exchanging the roles of the training and test halves.

I like the idea of doing it twice.  But one possible issue:  do you imagine that any of the participants will try manual query formulation for routing?  The two-fold replication might be difficult in that case -- couldn't really have the same user doing the formulation both ways round.

> > The other tricky point is how any of these methods might interact
> > with new wave, sampling based assessment approaches for large
> > collections...

If rejudging is acceptable, the entire new evaluation could be done on the basis of one of these methods.  But it's not obvious to me how to make sense of the two sets of assessments afterwards.

Steve





Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov