comments from Stephen Tomlinson and my replies


Stephen sent this just to the track organizers, but said I could forward the relevant parts: 

From: Stephen Tomlinson <stephent@magma.ca>
Date: January 26, 2007 9:59:43 PM CST
Subject: Re: 2007 Legal Track routing subtask, 26jan07 draft description

I basically agree with Chris that for evaluating runs trained on
existing assessments, it would make sense to omit those items 
for the evaluation.

That's starting to look like a consensus. 

To calculate inferred precision on the subsequent re-usable 
test collection, (i.e. for future users of the collection 
using it for adhoc experiments), I think the original judged items 
could be included, though just used to represent themselves,

Agreed.  There would be an estimated term with some variance, as well as a known term. 

(I've actually been thinking of a generalization to inferred
precision, where the sampling is done based on a probability
distribution favoring top-retrieved items rather than selecting
uniformly from the pool.  (The probability would be k/r, where...
r is the earliest rank any system retrieved the item, and k depends
on how much assessing we can do.  The weight of each judged

I'd certainly be in favor of oversampling the higher ranked items, and it would be worth looking carefully at the tradeoffs of different stratification methods.  See my next comment: 

(As an aside, I think the track primary measure should be
'inferred Recall@B' rather than 'inferred Precision@B',
even though I know it's hard to estimate R accurately.)

For the ad hoc task or the routing task? 

The problem is that getting a statistically unbiased estimate of R@B (or R @ anything) requires spending a big hunk of the assessment budget on documents that no system retrieves.  Almost all of those documents will turn out to be nonrelevant, so even then the estimate of R is going to have a huge confidence interval.   But I'd be interested in hearing concrete proposals for doing the best  possible given those cavaets. 

One big question for both ad hoc and routing is whether we can aspire to produce a usable test collection of the traditional sort.  (That is, where one assumes you've found all, or at least a fairly random sample of, the relevant.)  Or whether we should go into this assume that all future test collection users and going to have to use measures like inferred precision, and we should build the pool under that assumption. 

(I'm assuming, though, that the new judgements are consistent
with the old ones.  If they are not, well, I suppose future
users have the option of omitting the old judged items if
they prefer.)

That's an issue I'll take up in a separate message.

Dave



Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov