Re: interassessor consistency data on TREC 06 Legal track ad hoc topics



> From: "Dave Lewis (address for public mailing lists)" <misclists1@daviddlewis.com>
>
> > 1. There are more topics than I would expect that have
> >    substantial disagreements but are about even in 1N2R and
> >    1R2N numbers, eg
> >     18        80      689      20      7       5      18
> :
> >   topics like above are either actual factual disagreements (not
> >   just scope) or two factors are in play that the assessors
> >   disagree on.  I don't believe the latter.  In the past TREC
>
> Chris - I think this may be an effect of lawyers/interns/paralegals/ 
> law students as assessors, and maybe also that most of them are  
> relatively early in their legal training.  My sense is that Jason  
> expected (correct me if I'm wrong) that the assessors would take a  
> relatively broad interpretation of relevance.  This would usually be  
> the right thing to do, since the penalties can be severe for not  
> turning over a responsive document.  Instead, a number of the  
> assessors seemed to make rather picky (legalistic if I may say)  
> distinctions, which led to results very different than Jason and I  
> expected.
>
> What we can do about this is another matter.
>
> > 2. The number of lopsided topics (1N2R >> 1R2N and the reverse)
> >   doesn't both me, as I said, but the one-sided direction bothers
> >   me tremendously.
> >   1N2R is 5 or more greater than 1R2N for 1 topic
> >   1R2N is 9 or more greater than 1N2R for 15 topics!
> >
> >   That's an enormous discrepancy which is almost impossible to
> >   occur by pure chance.  The original assessor is much more
> >   lenient than the secondary assessor.  I don't know whether
> >   that's due to the original having better key words to look for,
> >   or the secondary having different expectations of relevance, or
> >   the secondary assessor doing a better job given fewer
> >   documents, or what.  But I believe the discrepancy calls into
> >   question whether we can trust these numbers at all.  There
> >   seems to be some unexplained systematic effect here.
>
> I'm not sure this is surprising.  Remember that Assessor 2 saw 50  
> documents, usually consisting of 25 that Assessor 1 rated relevant,  
> and 25 that Assessor 1 rated nonrelevant.  The 25 nonrelevant almost  
> certainly are "easy" nonrelevant, just because most documents are  
> easy nonrelevant (effectiveness of submitted runs was poor overall).   
> So we expect 1N2R to be very low in general.  The 25 relevant will  
> have the usual amount of disagreement, so 1R2N will be relatively  
> high in comparison on this sample of 50.
>
> But if we treat the 50 documents as a stratified sample from the  
> pool, we can work backwards to computed expected values for what the  
> disagreements would have been on the whole pool:
>
> Topic  #Pool  A=1R2R  B=1N2R  C=1R2N  D=1N2N
> 6       840      0.0     0.0   125.0   715.0  C
> 7       854    138.6   110.2    26.4   578.8  B
> 8       857    145.9    53.2    46.1   611.8  B
> 9       849     46.8     0.0    83.2   719.0  C
> 10      858      2.0     0.0     3.0   853.0  C
> 13      837     71.3     0.0    90.7   675.0  C
> 14      716     15.8   108.8    20.2   571.2  B
> 17      767      4.0     0.0     0.0   763.0  =
> 18      769     64.0   192.9    16.0   496.1  B
> 19      919    161.6     0.0   343.4   414.0  C
> 20      938      9.8    36.1    25.2   866.9  B
> 21      893     58.2    96.3   232.8   505.7  C
> 22      853     13.8     0.0    55.2   784.0  C
> 23      832    173.2    56.3   307.8   295.7  C
> 24      924      0.0    22.3     9.0   892.7  B
> 25      961     11.1   148.7     7.9   793.3  B
> 26      935    297.4   116.2    56.6   464.8  B
> 27      916    165.4    58.2    22.6   669.8  B
> 28      910     38.6    34.6     7.4   829.4  B
> 29      875     16.0    26.0     1.0   833.0  B
> 30      781     85.4    54.7    11.6   629.3  B
> 31      707    294.4    92.9    25.6   294.1  B
> 32      770     51.2   197.7    12.8   508.3  B
> 33      570      0.0     0.0    37.0   533.0  C
> 34      810    196.0    90.4    49.0   474.6  B
> 35      542     19.0    20.3    15.0   487.7  B
> 36      872      9.0    69.7     4.0   790.3  B
> 37      863     43.7    62.8    34.3   722.2  B
> 38      741     93.2   289.9    43.8   314.1  B
> 39      887     15.0   108.6     3.0   760.4  B
> 40      832      1.0    33.9     0.0   797.1  B
> 41      876      1.0     0.0     0.0   875.0  =
> 43      820     64.8   105.3    97.2   552.7  B
> 44      821     13.4     0.0    14.6   793.0  C
> 45      755    120.1     0.0    37.9   597.0  C
> 46      627     16.0     0.0    34.0   577.0  C
> 47      733      4.0    49.6     2.0   677.4  B
> 49      983      0.0   629.1     0.0   353.9  B
> 50      756     47.1    83.3    14.9   610.7  B
> 51      936     31.7    36.2     1.3   867.8  B
>
> The confidence intervals would be very large on these figures, so  
> don't take the numbers too seriously.  But you can see that actually  
> 1N2R is greater than 1R2N in the majority of cases.  The magnitude of  
> the disagreements blown up to the size of a typical TREC pool are  
> worrisome, of course.

Yes, I agree with your analysis (well, I also agree with your
statement about confidence intervals for several reasons, but
your analysis should certainly be accurate enough for a first
approximation!).  I had indeed forgotten that this was not 25
random relevant, but 25 relevant documents of Assessor 1 (as it
had to be).

That does make me worry a bit more about the magnitude of the
final error numbers.  I had thought overall things were not all
that bad; worse than TREC but not terrible.  Disagreements about
scope of relevance on 16 out of the 40 topics is not
unreasonable.  But in some sense we're only fully seeing half the
errors, the errors where A1 is more lenient than A2.  When A2 is
more lenient, we're not getting a fair measure - though it is
undoubtedly part of the reason of the size of the first category
I was concerned about (medium values of both 1R2N and 1N2R).

I think we have evidence that A1 was more lenient on those 15
topics of high 1R2N.  If the cases are symmetrical then we'd
expect 15 topics where A2 is more lenient, if they had to judge
the full pool.  So that's 30 out of the 40 topics that we expect
substantial disagreement on.  That is a lot!

> There is one systematic factor that's less benign that I worried a  
> little about.  Most of the people who played the role of Assessor 2  
> on one or more topics had also played the role of Assessor 1 on some  
> other topic.  Since their experience as Assessor 1 was usually that  
> the proportion of relevant was very small, I wonder if they carried  
> that over to their judgments in the role of Assessor 2.  We could  
> have avoided this by taking a random sample from the pool, instead of  
> 25 relevant and 25 nonrelevant, but then for most topics we'd then  
> end up with most dual assessments having been on "easy" nonrelevant  
> documents.

I agree again; assessor expectation is always present and is an
insidious thing to try to avoid. Your figures above though
indicate it's probably not nearly as bad as I had feared.  But we
don't know how much of the large differences we see are due to
experimental bias. 


Are there half-a-dozen people out there who are willing to look
at the 50 documents for, say, 3 queries each?  It won't conclusively
prove anything at all, but might give insights as to what's
happening. 


Overall, I'm not sure the averages you get across topics or even
within topics are comparable to anything done previously.  But
the number of topics on which we find or expect substantial
assessor disagreement is very high, both in your original
analysis and in my qualitative conclusion above.  

Do we need to go to a more TREC style topic, with a narrative
that gives advice on what makes a document relevant or not?

Chris



Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov