Re: evaluating routing when the document collection is reused
- Subject: Re: evaluating routing when the document collection is reused
- From: email@example.com
- Date: Fri, 26 Jan 2007 15:20:06 -0500
While doing extensive relevance feedback experiments, I gave up
on doing anything other than "residual collection". There were
just too many evaluation artifacts when doing anything else
(including split collection, but that was due to size of
collection and number of relevant documents, which are not a
So I would be against A, C, and D on those grounds.
I would probably be slightly against A and E on the grounds of
assessor efficiency and consistency. Either you have the
inefficiency of assessors rejudging documents judged the previous
year (and need to deal with what happens when they disagree), or
you have two assessors judging one topic, and you'll be
evaluating over both including their possible disagreements in
interpretation of what's relevant.
We also have the problem of duplicate (and near duplicate)
documents. For most (but not all) of last year's judgements, if
documents were close to each other in similarity, they would both
be judged or not judged. So this problem will have a bigger
impact on E than B (residual collection) since B will not
evaluate on either of a duplicate pair if they both were judged,
but E might use one for training and one for testing. Thus you
have an artificial effect with E.
B will have the two assessor problem, but the evaluation will all
be the second assessor; that's cleaner in my opinion.
So my preferences are
B > E >> A,C,D
> Date: Fri, 26 Jan 2007 12:50:23 -0500 (EST)
> From: "Dave Lewis (address for public mailing lists)" <firstname.lastname@example.org>
> Subject: evaluating routing when the document collection is reused
> The TREC 2007 Legal Track is planning to include a routing
> evaluation. (I'm not an organizer of the Legal track this year, but
> am helping out on a few issues.) The usual TREC-style routing
> evaluation works like this:
> 1. The topics are ad hoc topics from a previous year of
> TREC. Participants use both the topic descriptions, and previously
> assessed documents, to build a routing profile. The previously
> assessed documents allow relevance feedback and machine learning
> techniques to be used, typically leading to very high quality profiles.
> 2. The routing profile is used to rank a new document
> collection, with participants submitting the top K documents for each
> topic. Relevance judgments are created for the new collection (using
> pooling or other methods), and the top K document sets are assessed
> by the usual measures (e.g. R-precision).
> One way the 2007 Legal routing subtask differs from this model is
> that instead of a new collection, we will reuse the old one. This is
> both because we have no other similar collection, and because we want
> to improve the quality of the set of judged documents for the
> existing topics.
> Reusing the collection means that the training documents are part of
> the collection being ranked, and relevant training documents will
> show up at unnaturally high positions in the ranking. This is called
> "testing on the training data", and is in most cases a big no-no in
> machine learning research.
> It is less clear that testing on the training data is a bad thing in
> routing (see A. below). But it is the case that many standard
> measures of ranking effectiveness are disproportionately affected by
> relevant documents at the very top of the ranking, and if these are
> just the training documents that doesn't tell us much about different
> ranking methods.
> This issue has been faced since the early days of research on
> relevance feedback. Here's some techniques that have been used, with
> the names they were given in the SMART book:
> A. Do nothing. If the goal in legal discovery is to find,
> say, 70000 documents to be manually reviewed from a collection of 7
> million, it's quite possible that manually assessing 1000 of those
> documents for machine learning purposes would be both appropriate,
> and would not excuse those documents from the later complete manual
> review. Thus reranking those 1000 documents would not be a problem.
> B. "Residual Collection Evaluation": Omit the documents that
> have been judged for a particular topic from all submitted sets for
> that topic. If, for instance, the assessed sets for all topics are
> of size fewer than 1000, participants would be asked to submit the
> top K+1000 documents. The assessed documents would be removed from
> all submitted sets, and effectiveness measures based on top K would
> be computed.
> C. "Rank Freezing" : There are several variations on this, but
> all assume the judged documents come from the top of a single
> ranking. This isn't true for us.
> D. Ask participants to put all the known relevant documents at
> the very top of their submitted sets, and to omit all known
> nonrelevant documents. This would be a kind of optimistic rank
> freezing, intended to ensure that all runs getting equal benefit from
> the re-ranking of known documents. Then use an effectiveness measure
> (R-precision?) that is not too influenced by the top of the rankings.
> E. "Test and Control": Splitting entire document set into
> training and test halves. Only assessed documents that fall into the
> training set of documents would be used for training, and profiles
> would only be used to rank the test documents. Some of the
> previously judged documents would be re-retrieved from the test
> document set, but that's OK because they wouldn't have been used for
> training. This would have the disadvantage of only improving
> judgments on the test half of the collection, though one could do
> this twice, exchanging the roles of the training and test halves.
> I'd be interested in people's thoughts on this. In some sense I'm
> putting the cart before the horse here, since the first question
> should be what the goal of the routing evaluation is, and then what
> evaluation approach will best meet that goal. But in truth my goal
> is simply to improve the quality of assessments for the collection,
> so I'm not much help!
> The other tricky point is how any of these methods might interact
> with new wave, sampling based assessment approaches for large
Date Index |
Thread Index |
Problems or questions? Contact email@example.com