Re: interassessor consistency data on TREC 06 Legal track ad hoc topics
Dear all
In the context of INEX (XML retrieval evaluation), we have had major
problem with respect to consistent assessments (INEX used to define
relevance according to two dimensions, each on a four graded scale). I
was very worried (one topic had a 0% agreement)
Without going into details, we decided to simplify the definition of the
relevance, and have found that consistency increased. INEX 2006, after
lots of statistical testing, adopted a much simpler definition of
relevance (one dimension and continuous scale), and we are currently
looking at the consistency, and the effect of the definition of
relevance, how relevance is assessed, on consistency/agreements.
Mounia
Dave Lewis (address for public mailing lists) wrote:
>
> The 2007 Legal Track routing evaluation will reuse topics from the
> TREC 06 ad hoc evaluation, but in most cases will not be able to use
> the same assessors. One theme of NIST work on TREC is that the
> particular interpretation an assessor makes of a topic is not so
> important. What's critical is that the same interpretation is used
> for all the assessments, and several people have raised concerns about
> mixing assessments from two assessors.
>
> Anticipating this concern, during the 2006 Legal Track ad hoc
> evaluation we had a sample of the pool for each topic assessed by two
> assessors. The sample consisted of 25 documents judged relevant by the
> first assessor (or all such documents if fewer than 25), and enough
> nonrelevant to bring the sample to 50 documents (49 in one case due to
> a glitch).
>
> The results showed that consistency is a serious problem for a number
> of the topics. This table shows
>
> query ID,
> number relevant in pool (by primary assessor),
> number nonrelevant in pool (by primary assessor),
>
> and contingency table entries for the sample:
>
> a = primary assessor Rel, second assessor Rel
> b = primary assessor NonRel, second assessor Rel
> c = primary assessor Rel, second assessor NonRel
> d = primary assessor NonRel, second assessor NonRel
>
>
> Topic Pool_1N Pool_1R A=1R2R B=1N2R C=1R2N D=1N2N
> 6 125 715 0 0 25 25
> 7 165 689 21 4 4 21
> 8 192 665 19 2 6 23
> 9 130 719 9 0 16 24
> 10 5 853 2 0 3 45
> 13 162 675 11 0 14 25
> 14 36 680 11 4 14 21
> 17 4 763 4 0 0 46
> 18 80 689 20 7 5 18
> 19 505 414 8 0 17 25
> 20 35 903 7 1 18 24
> 21 291 602 5 4 20 21
> 22 69 784 5 0 20 25
> 23 481 352 9 4 16 21
> 24 9 915 0 1 9 40
> 25 19 942 7 6 5 32
> 26 354 581 21 5 4 20
> 27 188 728 22 2 3 23
> 28 46 864 21 1 4 24
> 29 17 859 16 1 1 32
> 30 97 684 22 2 3 23
> 31 320 387 23 6 2 19
> 32 64 706 20 7 5 18
> 33 37 533 0 0 25 25
> 34 245 565 20 4 5 21
> 35 34 508 14 1 11 24
> 36 13 860 9 3 4 34
> 37 78 785 14 2 11 23
> 38 137 604 17 12 8 13
> 39 18 869 15 4 3 28
> 40 1 831 1 2 0 47
> 41 1 875 1 0 0 49
> 43 162 658 10 4 15 21
> 44 28 793 12 0 13 25
> 45 158 597 19 0 6 25
> 46 50 577 8 0 17 25
> 47 6 727 4 3 2 41
> 49 0 983 0 32 0 18
> 50 62 694 19 3 6 22
> 51 33 904 24 1 1 24
>
> There's a variety of statistics one can compute from this. Here's
> agreement, (A+D)/(A+B+C+D), agreement on positives, 2A/(2A+B+C), and
> agreement on negatives, 2D/(2D+B+C), as estimated for the full pool
> from the stratified sample to the full pool (w/ sorting on agreement
> on positives):
>
> Topic Agree AgreePos AgreeNeg
> 41 1.00 1.00 1.00
> 17 1.00 1.00 1.00
> 45 0.95 0.86 0.97
> 31 0.83 0.83 0.83
> 27 0.91 0.80 0.94
> 26 0.81 0.78 0.84
> 8 0.88 0.75 0.93
> 34 0.83 0.74 0.87
> 30 0.92 0.72 0.95
> 7 0.84 0.67 0.89
> 44 0.98 0.65 0.99
> 28 0.95 0.65 0.97
> 51 0.96 0.63 0.98
> 13 0.89 0.61 0.94
> 10 1.00 0.57 1.00
> 29 0.97 0.54 0.98
> 9 0.90 0.53 0.94
> 35 0.94 0.52 0.96
> 50 0.87 0.49 0.93
> 23 0.56 0.49 0.62
> 46 0.95 0.48 0.97
> 19 0.63 0.48 0.71
> 37 0.89 0.47 0.94
> 43 0.75 0.39 0.84
> 18 0.73 0.38 0.83
> 38 0.55 0.36 0.65
> 32 0.73 0.33 0.83
> 22 0.94 0.33 0.97
> 21 0.63 0.26 0.75
> 20 0.94 0.24 0.97
> 39 0.87 0.21 0.93
> 36 0.92 0.20 0.95
> 14 0.82 0.20 0.90
> 47 0.93 0.13 0.96
> 25 0.84 0.12 0.91
> 40 0.96 0.06 0.98
> 6 0.85 0.00 0.92
> 49 0.36 0.00 0.53
> 33 0.94 0.00 0.97
> 24 0.97 0.00 0.98
>
> Only 9 of the topics have an expected agreement on positives of 0.70
> or better, which is pretty worrisome from the standpoint of combining
> relevance assessments from the TREC 06 and TREC 07 assessors.
>
> In the next message I'll lay out some possibilities for how to deal
> with this for the 2007 Legal Track routing task.
>
> Dave
>
--
------------------------------------------
Prof. Mounia Lalmas
Department of Computer Science
Queen Mary University of London
London E1 4NS
phone: (+44|0)20 7882 5200
fax: (+44|0)20 8980 6533
email: mounia@dcs.qmul.ac.uk
www: http://www.dcs.qmul.ac.uk/~mounia
------------------------------------------
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov