summary of discussion on possible 2007 TREC Legal routing task



I wanted to summarize where we had and hadn't reached consensus (in  
my opinion) on the possible 2007 TREC Legal routing task.  As  
mentioned, I'm not acting as track coordinator this year, and will  
not have the time to participate in discussions of the routing track  
in the future.   The decision about whether or not the routing task  
is run will be made by the track coordinators.  Thanks again to  
everyone for your comments!  Dave

-------------------------------------

Here's my sense of what we had converged to re the TREC 2007 Legal  
track routing task:

1. Use all 40 assessed TREC 2006 Legal ad hoc topics as 2007 routing  
topics.

     NOTE: If insufficient assessment resources are available, drop  
topics with low interassessor agreement.  See the thread  
"interassessor consistency data on TREC 06 Legal track ad hoc topics"  
for data on interassessor agreement.


2. The assessed 2006 Legal ad hoc pools will be used as training data  
for routing runs.


3. The low interassessor agreement, and the fact that assessors for  
most topics will be different in 2007 vs. 2006, means we cannot  
combine 2006 and 2007 assessments for the purpose of evaluating 2007  
runs.  Instead 2007 runs will be assessed using 2007 assessments only.


4. Both the 2006 and 2007 assessments will be distributed with the  
test collection for future use.  Future researchers could use one or  
the other or both.


5. While 2006 assessments will be ignored for the purpose of  
*assessing* 2007 routing runs, the fact that they will be used in  
*training* 2007 routing systems means that relevant documents from  
2006 will be ranked unrealistically high in 2007.  Therefore, those  
documents should be excluded in evaluating 2007 runs, i.e. residual  
collection evaluation should be used.

     NOTE: We still will want to re-assess some documents that were  
assessed in 2007, both due to some overlap of Legal routing 2007  
topics with Legal interactive 2007 topics, and to produce 2007  
assessments of those documents for use by future users of the test  
collection.


OPEN QUESTIONS:

The size of the collection, high known number of relevant documents  
for some runs, and poor effectiveness of runs contributing to the  
pool, mean that the pooling style evaluation in 2006 did not give a  
good measure of actual effectiveness of runs.    This almost  
certainly means that we want to choose the 2007 documents to be  
assessed not from the top of submitted runs, but using sampling-based  
approaches that allow expected values to be computed (e.g. the  
inferred average precision approach).    The following questions arise:

Question A. What sampling strategy should be used to choose documents  
to assess?

Question B. What effectiveness measure(s) will be used to evaluation  
2007 runs?

Question C. Can knowledge of which documents were assessed as  
relevant in 2006 help with sampling?  Can knowledge of the expert  
manual "run" in 2007?

Question D. How does residual collection evaluation effect this?

Question E. Can a sampling approach be developed which not only  
allows expectation based effectiveness measures to be computed for  
2006 runs, but allows them to be computed for runs on the test  
collection in the future?  (Note that such runs might or might not  
use residual collection evaluation.)  What effectiveness measures  
could be supported?

Question F. How does the interactive task affect all the above?


And a minor unrelated question:

Question G. For each topic, a sample of 50 documents got a second  
assessment in 2006 (for the purpose of computing interassessor  
agreement).  Should participants be allowed to use this second set of  
50 assessments for training, in addition to the 500-1000 original  
assessments for each topic.  I propose they should.  This is no more  
unrealistic than routing in general, and for a few topics this will  
vastly improve effectiveness.




Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov