evaluating routing when the document collection is reused



The TREC 2007 Legal Track is planning to include a routing  
evaluation.  (I'm not an organizer of the Legal track this year, but  
am helping out on a few issues.)  The usual TREC-style routing  
evaluation works like this:

         1. The topics are ad hoc topics from a previous year of  
TREC.  Participants use both the topic descriptions, and previously  
assessed documents, to build a routing profile.  The previously  
assessed documents allow relevance feedback and machine learning  
techniques to be used, typically leading to very high quality profiles.

         2. The routing profile is used to rank a new document  
collection, with participants submitting the top K documents for each  
topic.  Relevance judgments are created for the new collection (using  
pooling or other methods), and the top K document sets are assessed  
by the usual measures (e.g. R-precision).

One way the 2007 Legal routing subtask differs from this model is  
that instead of a new collection, we will reuse the old one.  This is  
both because we have no other similar collection, and because we want  
to improve the quality of the set of judged documents for the  
existing topics.

Reusing the collection means that the training documents are part of  
the collection being ranked, and relevant training documents will  
show up at unnaturally high positions in the ranking.  This is called  
"testing on the training data", and is in most cases a big no-no in  
machine learning research.

It is less clear that testing on the training data is a bad thing in  
routing (see A. below).  But it is the case that many standard  
measures of ranking effectiveness are disproportionately affected by  
relevant documents at the very top of the ranking, and if these are  
just the training documents that doesn't tell us much about different  
ranking methods.

This issue has been faced since the early days of research on  
relevance feedback.  Here's some techniques that have been used, with  
the names they were given in the SMART book:

       A. Do nothing.  If the goal in legal discovery is to find,  
say, 70000 documents to be manually reviewed from a collection of 7  
million, it's quite possible that manually assessing 1000 of those  
documents for machine learning purposes would be both appropriate,  
and would not excuse those documents  from the later complete manual  
review.  Thus reranking those 1000 documents would not be a problem.

       B. "Residual Collection Evaluation": Omit the documents that  
have been judged for a particular topic from all submitted sets for  
that topic.  If, for instance, the assessed sets for all topics are  
of size fewer than 1000, participants would be asked to submit the  
top K+1000 documents.  The assessed documents would be removed from  
all submitted sets, and effectiveness measures based on top K would  
be computed.

       C. "Rank Freezing" : There are several variations on this, but  
all assume the judged documents come from the top of a single  
ranking. This isn't true for us.

       D. Ask participants to put all the known relevant documents at  
the very top of their submitted sets, and to omit all known  
nonrelevant documents. This would be a kind of optimistic rank  
freezing, intended to ensure that all runs getting equal benefit from  
the re-ranking of known documents.  Then use an effectiveness measure  
(R-precision?) that is not too influenced by the top of the rankings.

       E. "Test and Control": Splitting entire document set into  
training and test halves.  Only assessed documents that fall into the  
training set of documents would be used for training, and profiles  
would only be used to rank the test documents.  Some of the  
previously judged documents would be re-retrieved from the test  
document set, but that's OK because they wouldn't have been used for  
training.  This would have the disadvantage of only improving  
judgments on the test half of the collection, though one could do  
this twice, exchanging the roles of the training and test halves.

I'd be interested in people's thoughts on this.  In some sense I'm  
putting the cart before the horse here, since the first question  
should be what the goal of the routing evaluation is, and then what  
evaluation approach will best meet that goal.  But in truth my goal  
is simply to improve the quality of assessments for the collection,  
so I'm not much help!

The other tricky point is how any of these methods might interact  
with new wave, sampling based assessment approaches for large  
collections...

Dave










Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov