evaluating routing when the document collection is reused
- Subject: evaluating routing when the document collection is reused
- From: "Dave Lewis (address for public mailing lists)" <misclists1@daviddlewis.com>
- Date: Fri, 26 Jan 2007 11:39:40 -0600
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
The TREC 2007 Legal Track is planning to include a routing
evaluation. (I'm not an organizer of the Legal track this year, but
am helping out on a few issues.) The usual TREC-style routing
evaluation works like this:
1. The topics are ad hoc topics from a previous year of
TREC. Participants use both the topic descriptions, and previously
assessed documents, to build a routing profile. The previously
assessed documents allow relevance feedback and machine learning
techniques to be used, typically leading to very high quality profiles.
2. The routing profile is used to rank a new document
collection, with participants submitting the top K documents for each
topic. Relevance judgments are created for the new collection (using
pooling or other methods), and the top K document sets are assessed
by the usual measures (e.g. R-precision).
One way the 2007 Legal routing subtask differs from this model is
that instead of a new collection, we will reuse the old one. This is
both because we have no other similar collection, and because we want
to improve the quality of the set of judged documents for the
existing topics.
Reusing the collection means that the training documents are part of
the collection being ranked, and relevant training documents will
show up at unnaturally high positions in the ranking. This is called
"testing on the training data", and is in most cases a big no-no in
machine learning research.
It is less clear that testing on the training data is a bad thing in
routing (see A. below). But it is the case that many standard
measures of ranking effectiveness are disproportionately affected by
relevant documents at the very top of the ranking, and if these are
just the training documents that doesn't tell us much about different
ranking methods.
This issue has been faced since the early days of research on
relevance feedback. Here's some techniques that have been used, with
the names they were given in the SMART book:
A. Do nothing. If the goal in legal discovery is to find,
say, 70000 documents to be manually reviewed from a collection of 7
million, it's quite possible that manually assessing 1000 of those
documents for machine learning purposes would be both appropriate,
and would not excuse those documents from the later complete manual
review. Thus reranking those 1000 documents would not be a problem.
B. "Residual Collection Evaluation": Omit the documents that
have been judged for a particular topic from all submitted sets for
that topic. If, for instance, the assessed sets for all topics are
of size fewer than 1000, participants would be asked to submit the
top K+1000 documents. The assessed documents would be removed from
all submitted sets, and effectiveness measures based on top K would
be computed.
C. "Rank Freezing" : There are several variations on this, but
all assume the judged documents come from the top of a single
ranking. This isn't true for us.
D. Ask participants to put all the known relevant documents at
the very top of their submitted sets, and to omit all known
nonrelevant documents. This would be a kind of optimistic rank
freezing, intended to ensure that all runs getting equal benefit from
the re-ranking of known documents. Then use an effectiveness measure
(R-precision?) that is not too influenced by the top of the rankings.
E. "Test and Control": Splitting entire document set into
training and test halves. Only assessed documents that fall into the
training set of documents would be used for training, and profiles
would only be used to rank the test documents. Some of the
previously judged documents would be re-retrieved from the test
document set, but that's OK because they wouldn't have been used for
training. This would have the disadvantage of only improving
judgments on the test half of the collection, though one could do
this twice, exchanging the roles of the training and test halves.
I'd be interested in people's thoughts on this. In some sense I'm
putting the cart before the horse here, since the first question
should be what the goal of the routing evaluation is, and then what
evaluation approach will best meet that goal. But in truth my goal
is simply to improve the quality of assessments for the collection,
so I'm not much help!
The other tricky point is how any of these methods might interact
with new wave, sampling based assessment approaches for large
collections...
Dave
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov