interassessor consistency data on TREC 06 Legal track ad hoc topics



The 2007 Legal Track routing evaluation will reuse topics from the  
TREC 06 ad hoc evaluation, but in most cases will not be able to use  
the same assessors.  One theme of NIST work on TREC is that the  
particular interpretation an assessor makes of a topic is not so  
important.  What's critical is that the same interpretation is used  
for all the assessments, and several people have raised concerns  
about mixing assessments from two assessors.

Anticipating this concern, during the 2006 Legal Track ad hoc  
evaluation we had a sample of the pool for each topic assessed by two  
assessors. The sample consisted of 25 documents judged relevant by  
the first assessor (or all such documents if fewer than 25), and  
enough nonrelevant to bring the sample to 50 documents (49 in one  
case due to a glitch).

The results showed that consistency is a serious problem for a number  
of the topics.  This table shows

    query ID,
    number relevant in pool (by primary assessor),
    number nonrelevant in pool (by primary assessor),

and contingency table entries for the sample:

    a = primary assessor Rel, second assessor Rel
    b = primary assessor NonRel, second assessor Rel
    c = primary assessor Rel, second assessor NonRel
    d = primary assessor NonRel, second assessor NonRel


Topic  Pool_1N  Pool_1R  A=1R2R B=1N2R  C=1R2N  D=1N2N
6        125      715       0      0      25      25
7        165      689      21      4       4      21
8        192      665      19      2       6      23
9        130      719       9      0      16      24
10         5      853       2      0       3      45
13       162      675      11      0      14      25
14        36      680      11      4      14      21
17         4      763       4      0       0      46
18        80      689      20      7       5      18
19       505      414       8      0      17      25
20        35      903       7      1      18      24
21       291      602       5      4      20      21
22        69      784       5      0      20      25
23       481      352       9      4      16      21
24         9      915       0      1       9      40
25        19      942       7      6       5      32
26       354      581      21      5       4      20
27       188      728      22      2       3      23
28        46      864      21      1       4      24
29        17      859      16      1       1      32
30        97      684      22      2       3      23
31       320      387      23      6       2      19
32        64      706      20      7       5      18
33        37      533       0      0      25      25
34       245      565      20      4       5      21
35        34      508      14      1      11      24
36        13      860       9      3       4      34
37        78      785      14      2      11      23
38       137      604      17     12       8      13
39        18      869      15      4       3      28
40         1      831       1      2       0      47
41         1      875       1      0       0      49
43       162      658      10      4      15      21
44        28      793      12      0      13      25
45       158      597      19      0       6      25
46        50      577       8      0      17      25
47         6      727       4      3       2      41
49         0      983       0     32       0      18
50        62      694      19      3       6      22
51        33      904      24      1       1      24

There's a variety of statistics one can compute from this.  Here's  
agreement, (A+D)/(A+B+C+D), agreement on positives, 2A/(2A+B+C), and  
agreement on negatives, 2D/(2D+B+C), as estimated for the full pool  
from the stratified sample to the full pool (w/ sorting on agreement  
on positives):

Topic Agree AgreePos AgreeNeg
41     1.00   1.00     1.00
17     1.00   1.00     1.00
45     0.95   0.86     0.97
31     0.83   0.83     0.83
27     0.91   0.80     0.94
26     0.81   0.78     0.84
8      0.88   0.75     0.93
34     0.83   0.74     0.87
30     0.92   0.72     0.95
7      0.84   0.67     0.89
44     0.98   0.65     0.99
28     0.95   0.65     0.97
51     0.96   0.63     0.98
13     0.89   0.61     0.94
10     1.00   0.57     1.00
29     0.97   0.54     0.98
9      0.90   0.53     0.94
35     0.94   0.52     0.96
50     0.87   0.49     0.93
23     0.56   0.49     0.62
46     0.95   0.48     0.97
19     0.63   0.48     0.71
37     0.89   0.47     0.94
43     0.75   0.39     0.84
18     0.73   0.38     0.83
38     0.55   0.36     0.65
32     0.73   0.33     0.83
22     0.94   0.33     0.97
21     0.63   0.26     0.75
20     0.94   0.24     0.97
39     0.87   0.21     0.93
36     0.92   0.20     0.95
14     0.82   0.20     0.90
47     0.93   0.13     0.96
25     0.84   0.12     0.91
40     0.96   0.06     0.98
6      0.85   0.00     0.92
49     0.36   0.00     0.53
33     0.94   0.00     0.97
24     0.97   0.00     0.98

Only 9 of the topics have an expected agreement on positives of 0.70  
or better, which is pretty worrisome from the standpoint of combining  
relevance assessments from the TREC 06 and TREC 07 assessors.

In the next message I'll lay out some possibilities for how to deal  
with this for the 2007 Legal Track routing task.

Dave



Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov