interassessor consistency data on TREC 06 Legal track ad hoc topics
- Subject: interassessor consistency data on TREC 06 Legal track ad hoc topics
- From: "Dave Lewis (address for public mailing lists)" <misclists1@daviddlewis.com>
- Date: Wed, 31 Jan 2007 16:12:10 -0600
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
- DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; h=X-Originating-IP:Mime-Version:Content-Type:Message-Id:Content-Transfer-Encoding:From:Subject:Date:To:X-Mailer; s=default; d=daviddlewis.com; b=HeRlMGzdiQlJYPoEEPvkMz0Aa9iPZUCvUGzrGq8dHce45/ku5qoGMPYxFWyZwTBGZuNMEzh094DRthSEEEO7RisgWwIhoe8ER1XjoUDd7LnDr+1kK+2Fn2WzT5zotYtis0/ZHO4lFiNBJWLQIpKK4zFfMQbFmG4pIsJUxXcVwWU=;
The 2007 Legal Track routing evaluation will reuse topics from the
TREC 06 ad hoc evaluation, but in most cases will not be able to use
the same assessors. One theme of NIST work on TREC is that the
particular interpretation an assessor makes of a topic is not so
important. What's critical is that the same interpretation is used
for all the assessments, and several people have raised concerns
about mixing assessments from two assessors.
Anticipating this concern, during the 2006 Legal Track ad hoc
evaluation we had a sample of the pool for each topic assessed by two
assessors. The sample consisted of 25 documents judged relevant by
the first assessor (or all such documents if fewer than 25), and
enough nonrelevant to bring the sample to 50 documents (49 in one
case due to a glitch).
The results showed that consistency is a serious problem for a number
of the topics. This table shows
query ID,
number relevant in pool (by primary assessor),
number nonrelevant in pool (by primary assessor),
and contingency table entries for the sample:
a = primary assessor Rel, second assessor Rel
b = primary assessor NonRel, second assessor Rel
c = primary assessor Rel, second assessor NonRel
d = primary assessor NonRel, second assessor NonRel
Topic Pool_1N Pool_1R A=1R2R B=1N2R C=1R2N D=1N2N
6 125 715 0 0 25 25
7 165 689 21 4 4 21
8 192 665 19 2 6 23
9 130 719 9 0 16 24
10 5 853 2 0 3 45
13 162 675 11 0 14 25
14 36 680 11 4 14 21
17 4 763 4 0 0 46
18 80 689 20 7 5 18
19 505 414 8 0 17 25
20 35 903 7 1 18 24
21 291 602 5 4 20 21
22 69 784 5 0 20 25
23 481 352 9 4 16 21
24 9 915 0 1 9 40
25 19 942 7 6 5 32
26 354 581 21 5 4 20
27 188 728 22 2 3 23
28 46 864 21 1 4 24
29 17 859 16 1 1 32
30 97 684 22 2 3 23
31 320 387 23 6 2 19
32 64 706 20 7 5 18
33 37 533 0 0 25 25
34 245 565 20 4 5 21
35 34 508 14 1 11 24
36 13 860 9 3 4 34
37 78 785 14 2 11 23
38 137 604 17 12 8 13
39 18 869 15 4 3 28
40 1 831 1 2 0 47
41 1 875 1 0 0 49
43 162 658 10 4 15 21
44 28 793 12 0 13 25
45 158 597 19 0 6 25
46 50 577 8 0 17 25
47 6 727 4 3 2 41
49 0 983 0 32 0 18
50 62 694 19 3 6 22
51 33 904 24 1 1 24
There's a variety of statistics one can compute from this. Here's
agreement, (A+D)/(A+B+C+D), agreement on positives, 2A/(2A+B+C), and
agreement on negatives, 2D/(2D+B+C), as estimated for the full pool
from the stratified sample to the full pool (w/ sorting on agreement
on positives):
Topic Agree AgreePos AgreeNeg
41 1.00 1.00 1.00
17 1.00 1.00 1.00
45 0.95 0.86 0.97
31 0.83 0.83 0.83
27 0.91 0.80 0.94
26 0.81 0.78 0.84
8 0.88 0.75 0.93
34 0.83 0.74 0.87
30 0.92 0.72 0.95
7 0.84 0.67 0.89
44 0.98 0.65 0.99
28 0.95 0.65 0.97
51 0.96 0.63 0.98
13 0.89 0.61 0.94
10 1.00 0.57 1.00
29 0.97 0.54 0.98
9 0.90 0.53 0.94
35 0.94 0.52 0.96
50 0.87 0.49 0.93
23 0.56 0.49 0.62
46 0.95 0.48 0.97
19 0.63 0.48 0.71
37 0.89 0.47 0.94
43 0.75 0.39 0.84
18 0.73 0.38 0.83
38 0.55 0.36 0.65
32 0.73 0.33 0.83
22 0.94 0.33 0.97
21 0.63 0.26 0.75
20 0.94 0.24 0.97
39 0.87 0.21 0.93
36 0.92 0.20 0.95
14 0.82 0.20 0.90
47 0.93 0.13 0.96
25 0.84 0.12 0.91
40 0.96 0.06 0.98
6 0.85 0.00 0.92
49 0.36 0.00 0.53
33 0.94 0.00 0.97
24 0.97 0.00 0.98
Only 9 of the topics have an expected agreement on positives of 0.70
or better, which is pretty worrisome from the standpoint of combining
relevance assessments from the TREC 06 and TREC 07 assessors.
In the next message I'll lay out some possibilities for how to deal
with this for the 2007 Legal Track routing task.
Dave
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov