Re: interassessor consistency data on TREC 06 Legal track ad hoc topics



> 1. There are more topics than I would expect that have
>    substantial disagreements but are about even in 1N2R and
>    1R2N numbers, eg
>     18        80      689      20      7       5      18
:
>   topics like above are either actual factual disagreements (not
>   just scope) or two factors are in play that the assessors
>   disagree on.  I don't believe the latter.  In the past TREC

Chris - I think this may be an effect of lawyers/interns/paralegals/ 
law students as assessors, and maybe also that most of them are  
relatively early in their legal training.  My sense is that Jason  
expected (correct me if I'm wrong) that the assessors would take a  
relatively broad interpretation of relevance.  This would usually be  
the right thing to do, since the penalties can be severe for not  
turning over a responsive document.  Instead, a number of the  
assessors seemed to make rather picky (legalistic if I may say)  
distinctions, which led to results very different than Jason and I  
expected.

What we can do about this is another matter.

> 2. The number of lopsided topics (1N2R >> 1R2N and the reverse)
>   doesn't both me, as I said, but the one-sided direction bothers
>   me tremendously.
>   1N2R is 5 or more greater than 1R2N for 1 topic
>   1R2N is 9 or more greater than 1N2R for 15 topics!
>
>   That's an enormous discrepancy which is almost impossible to
>   occur by pure chance.  The original assessor is much more
>   lenient than the secondary assessor.  I don't know whether
>   that's due to the original having better key words to look for,
>   or the secondary having different expectations of relevance, or
>   the secondary assessor doing a better job given fewer
>   documents, or what.  But I believe the discrepancy calls into
>   question whether we can trust these numbers at all.  There
>   seems to be some unexplained systematic effect here.

I'm not sure this is surprising.  Remember that Assessor 2 saw 50  
documents, usually consisting of 25 that Assessor 1 rated relevant,  
and 25 that Assessor 1 rated nonrelevant.  The 25 nonrelevant almost  
certainly are "easy" nonrelevant, just because most documents are  
easy nonrelevant (effectiveness of submitted runs was poor overall).   
So we expect 1N2R to be very low in general.  The 25 relevant will  
have the usual amount of disagreement, so 1R2N will be relatively  
high in comparison on this sample of 50.

But if we treat the 50 documents as a stratified sample from the  
pool, we can work backwards to computed expected values for what the  
disagreements would have been on the whole pool:

Topic  #Pool  A=1R2R  B=1N2R  C=1R2N  D=1N2N
6       840      0.0     0.0   125.0   715.0  C
7       854    138.6   110.2    26.4   578.8  B
8       857    145.9    53.2    46.1   611.8  B
9       849     46.8     0.0    83.2   719.0  C
10      858      2.0     0.0     3.0   853.0  C
13      837     71.3     0.0    90.7   675.0  C
14      716     15.8   108.8    20.2   571.2  B
17      767      4.0     0.0     0.0   763.0  =
18      769     64.0   192.9    16.0   496.1  B
19      919    161.6     0.0   343.4   414.0  C
20      938      9.8    36.1    25.2   866.9  B
21      893     58.2    96.3   232.8   505.7  C
22      853     13.8     0.0    55.2   784.0  C
23      832    173.2    56.3   307.8   295.7  C
24      924      0.0    22.3     9.0   892.7  B
25      961     11.1   148.7     7.9   793.3  B
26      935    297.4   116.2    56.6   464.8  B
27      916    165.4    58.2    22.6   669.8  B
28      910     38.6    34.6     7.4   829.4  B
29      875     16.0    26.0     1.0   833.0  B
30      781     85.4    54.7    11.6   629.3  B
31      707    294.4    92.9    25.6   294.1  B
32      770     51.2   197.7    12.8   508.3  B
33      570      0.0     0.0    37.0   533.0  C
34      810    196.0    90.4    49.0   474.6  B
35      542     19.0    20.3    15.0   487.7  B
36      872      9.0    69.7     4.0   790.3  B
37      863     43.7    62.8    34.3   722.2  B
38      741     93.2   289.9    43.8   314.1  B
39      887     15.0   108.6     3.0   760.4  B
40      832      1.0    33.9     0.0   797.1  B
41      876      1.0     0.0     0.0   875.0  =
43      820     64.8   105.3    97.2   552.7  B
44      821     13.4     0.0    14.6   793.0  C
45      755    120.1     0.0    37.9   597.0  C
46      627     16.0     0.0    34.0   577.0  C
47      733      4.0    49.6     2.0   677.4  B
49      983      0.0   629.1     0.0   353.9  B
50      756     47.1    83.3    14.9   610.7  B
51      936     31.7    36.2     1.3   867.8  B

The confidence intervals would be very large on these figures, so  
don't take the numbers too seriously.  But you can see that actually  
1N2R is greater than 1R2N in the majority of cases.  The magnitude of  
the disagreements blown up to the size of a typical TREC pool are  
worrisome, of course.

There is one systematic factor that's less benign that I worried a  
little about.  Most of the people who played the role of Assessor 2  
on one or more topics had also played the role of Assessor 1 on some  
other topic.  Since their experience as Assessor 1 was usually that  
the proportion of relevant was very small, I wonder if they carried  
that over to their judgments in the role of Assessor 2.  We could  
have avoided this by taking a random sample from the pool, instead of  
25 relevant and 25 nonrelevant, but then for most topics we'd then  
end up with most dual assessments having been on "easy" nonrelevant  
documents.

Dave








Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov