Stephen sent this just to the track organizers, but said I could forward the relevant parts:
Agreed. There would be an estimated term with some variance, as well as a known term.
I'd certainly be in favor of oversampling the higher ranked items, and it would be worth looking carefully at the tradeoffs of different stratification methods. See my next comment:
For the ad hoc task or the routing task? The problem is that getting a statistically unbiased estimate of R@B (or R @ anything) requires spending a big hunk of the assessment budget on documents that no system retrieves. Almost all of those documents will turn out to be nonrelevant, so even then the estimate of R is going to have a huge confidence interval. But I'd be interested in hearing concrete proposals for doing the best possible given those cavaets. One big question for both ad hoc and routing is whether we can aspire to produce a usable test collection of the traditional sort. (That is, where one assumes you've found all, or at least a fairly random sample of, the relevant.) Or whether we should go into this assume that all future test collection users and going to have to use measures like inferred precision, and we should build the pool under that assumption.
Dave |