Re: interassessor consistency data on TREC 06 Legal track ad hoc topics
- Subject: Re: interassessor consistency data on TREC 06 Legal track ad hoc topics
- From: cabuckley@sabir.com
- Date: Thu, 1 Feb 2007 15:51:18 -0500
> From: "Dave Lewis (address for public mailing lists)" <misclists1@daviddlewis.com>
>
> > 1. There are more topics than I would expect that have
> > substantial disagreements but are about even in 1N2R and
> > 1R2N numbers, eg
> > 18 80 689 20 7 5 18
> :
> > topics like above are either actual factual disagreements (not
> > just scope) or two factors are in play that the assessors
> > disagree on. I don't believe the latter. In the past TREC
>
> Chris - I think this may be an effect of lawyers/interns/paralegals/
> law students as assessors, and maybe also that most of them are
> relatively early in their legal training. My sense is that Jason
> expected (correct me if I'm wrong) that the assessors would take a
> relatively broad interpretation of relevance. This would usually be
> the right thing to do, since the penalties can be severe for not
> turning over a responsive document. Instead, a number of the
> assessors seemed to make rather picky (legalistic if I may say)
> distinctions, which led to results very different than Jason and I
> expected.
>
> What we can do about this is another matter.
>
> > 2. The number of lopsided topics (1N2R >> 1R2N and the reverse)
> > doesn't both me, as I said, but the one-sided direction bothers
> > me tremendously.
> > 1N2R is 5 or more greater than 1R2N for 1 topic
> > 1R2N is 9 or more greater than 1N2R for 15 topics!
> >
> > That's an enormous discrepancy which is almost impossible to
> > occur by pure chance. The original assessor is much more
> > lenient than the secondary assessor. I don't know whether
> > that's due to the original having better key words to look for,
> > or the secondary having different expectations of relevance, or
> > the secondary assessor doing a better job given fewer
> > documents, or what. But I believe the discrepancy calls into
> > question whether we can trust these numbers at all. There
> > seems to be some unexplained systematic effect here.
>
> I'm not sure this is surprising. Remember that Assessor 2 saw 50
> documents, usually consisting of 25 that Assessor 1 rated relevant,
> and 25 that Assessor 1 rated nonrelevant. The 25 nonrelevant almost
> certainly are "easy" nonrelevant, just because most documents are
> easy nonrelevant (effectiveness of submitted runs was poor overall).
> So we expect 1N2R to be very low in general. The 25 relevant will
> have the usual amount of disagreement, so 1R2N will be relatively
> high in comparison on this sample of 50.
>
> But if we treat the 50 documents as a stratified sample from the
> pool, we can work backwards to computed expected values for what the
> disagreements would have been on the whole pool:
>
> Topic #Pool A=1R2R B=1N2R C=1R2N D=1N2N
> 6 840 0.0 0.0 125.0 715.0 C
> 7 854 138.6 110.2 26.4 578.8 B
> 8 857 145.9 53.2 46.1 611.8 B
> 9 849 46.8 0.0 83.2 719.0 C
> 10 858 2.0 0.0 3.0 853.0 C
> 13 837 71.3 0.0 90.7 675.0 C
> 14 716 15.8 108.8 20.2 571.2 B
> 17 767 4.0 0.0 0.0 763.0 =
> 18 769 64.0 192.9 16.0 496.1 B
> 19 919 161.6 0.0 343.4 414.0 C
> 20 938 9.8 36.1 25.2 866.9 B
> 21 893 58.2 96.3 232.8 505.7 C
> 22 853 13.8 0.0 55.2 784.0 C
> 23 832 173.2 56.3 307.8 295.7 C
> 24 924 0.0 22.3 9.0 892.7 B
> 25 961 11.1 148.7 7.9 793.3 B
> 26 935 297.4 116.2 56.6 464.8 B
> 27 916 165.4 58.2 22.6 669.8 B
> 28 910 38.6 34.6 7.4 829.4 B
> 29 875 16.0 26.0 1.0 833.0 B
> 30 781 85.4 54.7 11.6 629.3 B
> 31 707 294.4 92.9 25.6 294.1 B
> 32 770 51.2 197.7 12.8 508.3 B
> 33 570 0.0 0.0 37.0 533.0 C
> 34 810 196.0 90.4 49.0 474.6 B
> 35 542 19.0 20.3 15.0 487.7 B
> 36 872 9.0 69.7 4.0 790.3 B
> 37 863 43.7 62.8 34.3 722.2 B
> 38 741 93.2 289.9 43.8 314.1 B
> 39 887 15.0 108.6 3.0 760.4 B
> 40 832 1.0 33.9 0.0 797.1 B
> 41 876 1.0 0.0 0.0 875.0 =
> 43 820 64.8 105.3 97.2 552.7 B
> 44 821 13.4 0.0 14.6 793.0 C
> 45 755 120.1 0.0 37.9 597.0 C
> 46 627 16.0 0.0 34.0 577.0 C
> 47 733 4.0 49.6 2.0 677.4 B
> 49 983 0.0 629.1 0.0 353.9 B
> 50 756 47.1 83.3 14.9 610.7 B
> 51 936 31.7 36.2 1.3 867.8 B
>
> The confidence intervals would be very large on these figures, so
> don't take the numbers too seriously. But you can see that actually
> 1N2R is greater than 1R2N in the majority of cases. The magnitude of
> the disagreements blown up to the size of a typical TREC pool are
> worrisome, of course.
Yes, I agree with your analysis (well, I also agree with your
statement about confidence intervals for several reasons, but
your analysis should certainly be accurate enough for a first
approximation!). I had indeed forgotten that this was not 25
random relevant, but 25 relevant documents of Assessor 1 (as it
had to be).
That does make me worry a bit more about the magnitude of the
final error numbers. I had thought overall things were not all
that bad; worse than TREC but not terrible. Disagreements about
scope of relevance on 16 out of the 40 topics is not
unreasonable. But in some sense we're only fully seeing half the
errors, the errors where A1 is more lenient than A2. When A2 is
more lenient, we're not getting a fair measure - though it is
undoubtedly part of the reason of the size of the first category
I was concerned about (medium values of both 1R2N and 1N2R).
I think we have evidence that A1 was more lenient on those 15
topics of high 1R2N. If the cases are symmetrical then we'd
expect 15 topics where A2 is more lenient, if they had to judge
the full pool. So that's 30 out of the 40 topics that we expect
substantial disagreement on. That is a lot!
> There is one systematic factor that's less benign that I worried a
> little about. Most of the people who played the role of Assessor 2
> on one or more topics had also played the role of Assessor 1 on some
> other topic. Since their experience as Assessor 1 was usually that
> the proportion of relevant was very small, I wonder if they carried
> that over to their judgments in the role of Assessor 2. We could
> have avoided this by taking a random sample from the pool, instead of
> 25 relevant and 25 nonrelevant, but then for most topics we'd then
> end up with most dual assessments having been on "easy" nonrelevant
> documents.
I agree again; assessor expectation is always present and is an
insidious thing to try to avoid. Your figures above though
indicate it's probably not nearly as bad as I had feared. But we
don't know how much of the large differences we see are due to
experimental bias.
Are there half-a-dozen people out there who are willing to look
at the 50 documents for, say, 3 queries each? It won't conclusively
prove anything at all, but might give insights as to what's
happening.
Overall, I'm not sure the averages you get across topics or even
within topics are comparable to anything done previously. But
the number of topics on which we find or expect substantial
assessor disagreement is very high, both in your original
analysis and in my qualitative conclusion above.
Do we need to go to a more TREC style topic, with a narrative
that gives advice on what makes a document relevant or not?
Chris
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov