Re: interassessor consistency data on TREC 06 Legal track ad hoc topics



Dear all

In the context of INEX (XML retrieval evaluation), we have had major 
problem with respect to consistent assessments (INEX used to define 
relevance according to two dimensions, each on a four graded scale). I 
was very worried (one topic had a 0% agreement)
Without going into details, we decided to simplify the definition of the 
relevance, and have found that consistency increased. INEX 2006, after 
lots of statistical testing, adopted a much simpler definition of 
relevance (one dimension and continuous scale), and we are currently 
looking at the consistency, and the effect of the definition of 
relevance, how relevance is assessed, on consistency/agreements.

Mounia



Dave Lewis (address for public mailing lists) wrote:
>
> The 2007 Legal Track routing evaluation will reuse topics from the 
> TREC 06 ad hoc evaluation, but in most cases will not be able to use 
> the same assessors.  One theme of NIST work on TREC is that the 
> particular interpretation an assessor makes of a topic is not so 
> important.  What's critical is that the same interpretation is used 
> for all the assessments, and several people have raised concerns about 
> mixing assessments from two assessors.
>
> Anticipating this concern, during the 2006 Legal Track ad hoc 
> evaluation we had a sample of the pool for each topic assessed by two 
> assessors. The sample consisted of 25 documents judged relevant by the 
> first assessor (or all such documents if fewer than 25), and enough 
> nonrelevant to bring the sample to 50 documents (49 in one case due to 
> a glitch).
>
> The results showed that consistency is a serious problem for a number 
> of the topics.  This table shows
>
>    query ID,
>    number relevant in pool (by primary assessor),
>    number nonrelevant in pool (by primary assessor),
>
> and contingency table entries for the sample:
>
>    a = primary assessor Rel, second assessor Rel
>    b = primary assessor NonRel, second assessor Rel
>    c = primary assessor Rel, second assessor NonRel
>    d = primary assessor NonRel, second assessor NonRel
>
>
> Topic  Pool_1N  Pool_1R  A=1R2R B=1N2R  C=1R2N  D=1N2N
> 6        125      715       0      0      25      25
> 7        165      689      21      4       4      21
> 8        192      665      19      2       6      23
> 9        130      719       9      0      16      24
> 10         5      853       2      0       3      45
> 13       162      675      11      0      14      25
> 14        36      680      11      4      14      21
> 17         4      763       4      0       0      46
> 18        80      689      20      7       5      18
> 19       505      414       8      0      17      25
> 20        35      903       7      1      18      24
> 21       291      602       5      4      20      21
> 22        69      784       5      0      20      25
> 23       481      352       9      4      16      21
> 24         9      915       0      1       9      40
> 25        19      942       7      6       5      32
> 26       354      581      21      5       4      20
> 27       188      728      22      2       3      23
> 28        46      864      21      1       4      24
> 29        17      859      16      1       1      32
> 30        97      684      22      2       3      23
> 31       320      387      23      6       2      19
> 32        64      706      20      7       5      18
> 33        37      533       0      0      25      25
> 34       245      565      20      4       5      21
> 35        34      508      14      1      11      24
> 36        13      860       9      3       4      34
> 37        78      785      14      2      11      23
> 38       137      604      17     12       8      13
> 39        18      869      15      4       3      28
> 40         1      831       1      2       0      47
> 41         1      875       1      0       0      49
> 43       162      658      10      4      15      21
> 44        28      793      12      0      13      25
> 45       158      597      19      0       6      25
> 46        50      577       8      0      17      25
> 47         6      727       4      3       2      41
> 49         0      983       0     32       0      18
> 50        62      694      19      3       6      22
> 51        33      904      24      1       1      24
>
> There's a variety of statistics one can compute from this.  Here's 
> agreement, (A+D)/(A+B+C+D), agreement on positives, 2A/(2A+B+C), and 
> agreement on negatives, 2D/(2D+B+C), as estimated for the full pool 
> from the stratified sample to the full pool (w/ sorting on agreement 
> on positives):
>
> Topic Agree AgreePos AgreeNeg
> 41     1.00   1.00     1.00
> 17     1.00   1.00     1.00
> 45     0.95   0.86     0.97
> 31     0.83   0.83     0.83
> 27     0.91   0.80     0.94
> 26     0.81   0.78     0.84
> 8      0.88   0.75     0.93
> 34     0.83   0.74     0.87
> 30     0.92   0.72     0.95
> 7      0.84   0.67     0.89
> 44     0.98   0.65     0.99
> 28     0.95   0.65     0.97
> 51     0.96   0.63     0.98
> 13     0.89   0.61     0.94
> 10     1.00   0.57     1.00
> 29     0.97   0.54     0.98
> 9      0.90   0.53     0.94
> 35     0.94   0.52     0.96
> 50     0.87   0.49     0.93
> 23     0.56   0.49     0.62
> 46     0.95   0.48     0.97
> 19     0.63   0.48     0.71
> 37     0.89   0.47     0.94
> 43     0.75   0.39     0.84
> 18     0.73   0.38     0.83
> 38     0.55   0.36     0.65
> 32     0.73   0.33     0.83
> 22     0.94   0.33     0.97
> 21     0.63   0.26     0.75
> 20     0.94   0.24     0.97
> 39     0.87   0.21     0.93
> 36     0.92   0.20     0.95
> 14     0.82   0.20     0.90
> 47     0.93   0.13     0.96
> 25     0.84   0.12     0.91
> 40     0.96   0.06     0.98
> 6      0.85   0.00     0.92
> 49     0.36   0.00     0.53
> 33     0.94   0.00     0.97
> 24     0.97   0.00     0.98
>
> Only 9 of the topics have an expected agreement on positives of 0.70 
> or better, which is pretty worrisome from the standpoint of combining 
> relevance assessments from the TREC 06 and TREC 07 assessors.
>
> In the next message I'll lay out some possibilities for how to deal 
> with this for the 2007 Legal Track routing task.
>
> Dave
>

-- 
------------------------------------------
Prof. Mounia Lalmas
Department of Computer Science
Queen Mary University of London
London E1 4NS
phone: (+44|0)20 7882 5200
fax: (+44|0)20 8980 6533
email: mounia@dcs.qmul.ac.uk
www: http://www.dcs.qmul.ac.uk/~mounia
------------------------------------------




Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov