Re: interassessor consistency data on TREC 06 Legal track ad hoc topics
- Subject: Re: interassessor consistency data on TREC 06 Legal track ad hoc topics
- From: "Dave Lewis (address for public mailing lists)" <misclists1@daviddlewis.com>
- Date: Thu, 1 Feb 2007 13:12:32 -0600
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
- DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; h=X-Originating-IP:In-Reply-To:References:Mime-Version:Content-Type:Message-Id:Content-Transfer-Encoding:From:Subject:Date:To:X-Mailer; s=default; d=daviddlewis.com; b=GGftw5dbowNcSJGXjX3Y16/jJX3G//P3eIFuepDmX4QPw12IRjJlOboZJi+cdyWX2QdDuh2g5sXY2tsfU1kB9iT0aNThNLZ47H3K2wbaMllcuX4fCp4EshWl2xLaxlFwY6XIDUvkX1AOLUotEtFdUAIyXspvUxrdCxWcQIzP2wc=;
- In-Reply-To: <200702011333.l11DXVRe018980@pc5.sabir.com>
- References: <200702011333.l11DXVRe018980@pc5.sabir.com>
> 1. There are more topics than I would expect that have
> substantial disagreements but are about even in 1N2R and
> 1R2N numbers, eg
> 18 80 689 20 7 5 18
:
> topics like above are either actual factual disagreements (not
> just scope) or two factors are in play that the assessors
> disagree on. I don't believe the latter. In the past TREC
Chris - I think this may be an effect of lawyers/interns/paralegals/
law students as assessors, and maybe also that most of them are
relatively early in their legal training. My sense is that Jason
expected (correct me if I'm wrong) that the assessors would take a
relatively broad interpretation of relevance. This would usually be
the right thing to do, since the penalties can be severe for not
turning over a responsive document. Instead, a number of the
assessors seemed to make rather picky (legalistic if I may say)
distinctions, which led to results very different than Jason and I
expected.
What we can do about this is another matter.
> 2. The number of lopsided topics (1N2R >> 1R2N and the reverse)
> doesn't both me, as I said, but the one-sided direction bothers
> me tremendously.
> 1N2R is 5 or more greater than 1R2N for 1 topic
> 1R2N is 9 or more greater than 1N2R for 15 topics!
>
> That's an enormous discrepancy which is almost impossible to
> occur by pure chance. The original assessor is much more
> lenient than the secondary assessor. I don't know whether
> that's due to the original having better key words to look for,
> or the secondary having different expectations of relevance, or
> the secondary assessor doing a better job given fewer
> documents, or what. But I believe the discrepancy calls into
> question whether we can trust these numbers at all. There
> seems to be some unexplained systematic effect here.
I'm not sure this is surprising. Remember that Assessor 2 saw 50
documents, usually consisting of 25 that Assessor 1 rated relevant,
and 25 that Assessor 1 rated nonrelevant. The 25 nonrelevant almost
certainly are "easy" nonrelevant, just because most documents are
easy nonrelevant (effectiveness of submitted runs was poor overall).
So we expect 1N2R to be very low in general. The 25 relevant will
have the usual amount of disagreement, so 1R2N will be relatively
high in comparison on this sample of 50.
But if we treat the 50 documents as a stratified sample from the
pool, we can work backwards to computed expected values for what the
disagreements would have been on the whole pool:
Topic #Pool A=1R2R B=1N2R C=1R2N D=1N2N
6 840 0.0 0.0 125.0 715.0 C
7 854 138.6 110.2 26.4 578.8 B
8 857 145.9 53.2 46.1 611.8 B
9 849 46.8 0.0 83.2 719.0 C
10 858 2.0 0.0 3.0 853.0 C
13 837 71.3 0.0 90.7 675.0 C
14 716 15.8 108.8 20.2 571.2 B
17 767 4.0 0.0 0.0 763.0 =
18 769 64.0 192.9 16.0 496.1 B
19 919 161.6 0.0 343.4 414.0 C
20 938 9.8 36.1 25.2 866.9 B
21 893 58.2 96.3 232.8 505.7 C
22 853 13.8 0.0 55.2 784.0 C
23 832 173.2 56.3 307.8 295.7 C
24 924 0.0 22.3 9.0 892.7 B
25 961 11.1 148.7 7.9 793.3 B
26 935 297.4 116.2 56.6 464.8 B
27 916 165.4 58.2 22.6 669.8 B
28 910 38.6 34.6 7.4 829.4 B
29 875 16.0 26.0 1.0 833.0 B
30 781 85.4 54.7 11.6 629.3 B
31 707 294.4 92.9 25.6 294.1 B
32 770 51.2 197.7 12.8 508.3 B
33 570 0.0 0.0 37.0 533.0 C
34 810 196.0 90.4 49.0 474.6 B
35 542 19.0 20.3 15.0 487.7 B
36 872 9.0 69.7 4.0 790.3 B
37 863 43.7 62.8 34.3 722.2 B
38 741 93.2 289.9 43.8 314.1 B
39 887 15.0 108.6 3.0 760.4 B
40 832 1.0 33.9 0.0 797.1 B
41 876 1.0 0.0 0.0 875.0 =
43 820 64.8 105.3 97.2 552.7 B
44 821 13.4 0.0 14.6 793.0 C
45 755 120.1 0.0 37.9 597.0 C
46 627 16.0 0.0 34.0 577.0 C
47 733 4.0 49.6 2.0 677.4 B
49 983 0.0 629.1 0.0 353.9 B
50 756 47.1 83.3 14.9 610.7 B
51 936 31.7 36.2 1.3 867.8 B
The confidence intervals would be very large on these figures, so
don't take the numbers too seriously. But you can see that actually
1N2R is greater than 1R2N in the majority of cases. The magnitude of
the disagreements blown up to the size of a typical TREC pool are
worrisome, of course.
There is one systematic factor that's less benign that I worried a
little about. Most of the people who played the role of Assessor 2
on one or more topics had also played the role of Assessor 1 on some
other topic. Since their experience as Assessor 1 was usually that
the proportion of relevant was very small, I wonder if they carried
that over to their judgments in the role of Assessor 2. We could
have avoided this by taking a random sample from the pool, instead of
25 relevant and 25 nonrelevant, but then for most topics we'd then
end up with most dual assessments having been on "easy" nonrelevant
documents.
Dave
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov