summary of discussion on possible 2007 TREC Legal routing task
- Subject: summary of discussion on possible 2007 TREC Legal routing task
- From: "Dave Lewis (address for public mailing lists)" <misclists1@daviddlewis.com>
- Date: Mon, 26 Mar 2007 14:53:45 -0500
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
- DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; h=X-Originating-IP:Mime-Version:Content-Type:Message-Id:Content-Transfer-Encoding:From:Subject:Date:To:X-Mailer; s=default; d=daviddlewis.com; b=bbBBSk9JI2z/WHy4td6HyzkP2GpAdLDoHc0E09cCaAPK97vh4LR8vTXjvYqjhV/XkqSUmnDwCiw9XlrtC8fPGFl+ELQNBIlk/rwhIyf75tJOTTEWH5vNBxjXSVfphP0LZXxrw/KOrqwNh3kauaDbsKTgorJMH1hSf6uazQyAUos=;
I wanted to summarize where we had and hadn't reached consensus (in
my opinion) on the possible 2007 TREC Legal routing task. As
mentioned, I'm not acting as track coordinator this year, and will
not have the time to participate in discussions of the routing track
in the future. The decision about whether or not the routing task
is run will be made by the track coordinators. Thanks again to
everyone for your comments! Dave
-------------------------------------
Here's my sense of what we had converged to re the TREC 2007 Legal
track routing task:
1. Use all 40 assessed TREC 2006 Legal ad hoc topics as 2007 routing
topics.
NOTE: If insufficient assessment resources are available, drop
topics with low interassessor agreement. See the thread
"interassessor consistency data on TREC 06 Legal track ad hoc topics"
for data on interassessor agreement.
2. The assessed 2006 Legal ad hoc pools will be used as training data
for routing runs.
3. The low interassessor agreement, and the fact that assessors for
most topics will be different in 2007 vs. 2006, means we cannot
combine 2006 and 2007 assessments for the purpose of evaluating 2007
runs. Instead 2007 runs will be assessed using 2007 assessments only.
4. Both the 2006 and 2007 assessments will be distributed with the
test collection for future use. Future researchers could use one or
the other or both.
5. While 2006 assessments will be ignored for the purpose of
*assessing* 2007 routing runs, the fact that they will be used in
*training* 2007 routing systems means that relevant documents from
2006 will be ranked unrealistically high in 2007. Therefore, those
documents should be excluded in evaluating 2007 runs, i.e. residual
collection evaluation should be used.
NOTE: We still will want to re-assess some documents that were
assessed in 2007, both due to some overlap of Legal routing 2007
topics with Legal interactive 2007 topics, and to produce 2007
assessments of those documents for use by future users of the test
collection.
OPEN QUESTIONS:
The size of the collection, high known number of relevant documents
for some runs, and poor effectiveness of runs contributing to the
pool, mean that the pooling style evaluation in 2006 did not give a
good measure of actual effectiveness of runs. This almost
certainly means that we want to choose the 2007 documents to be
assessed not from the top of submitted runs, but using sampling-based
approaches that allow expected values to be computed (e.g. the
inferred average precision approach). The following questions arise:
Question A. What sampling strategy should be used to choose documents
to assess?
Question B. What effectiveness measure(s) will be used to evaluation
2007 runs?
Question C. Can knowledge of which documents were assessed as
relevant in 2006 help with sampling? Can knowledge of the expert
manual "run" in 2007?
Question D. How does residual collection evaluation effect this?
Question E. Can a sampling approach be developed which not only
allows expectation based effectiveness measures to be computed for
2006 runs, but allows them to be computed for runs on the test
collection in the future? (Note that such runs might or might not
use residual collection evaluation.) What effectiveness measures
could be supported?
Question F. How does the interactive task affect all the above?
And a minor unrelated question:
Question G. For each topic, a sample of 50 documents got a second
assessment in 2006 (for the purpose of computing interassessor
agreement). Should participants be allowed to use this second set of
50 assessments for training, in addition to the 500-1000 original
assessments for each topic. I propose they should. This is no more
unrealistic than routing in general, and for a few topics this will
vastly improve effectiveness.
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov