Re: NDCG



Hi All -

The following is a clip from a doc on (n)(D)CG focusing on some of the 
issues in our discussion:

- Kal


-------------------CLIP--------------------


The following issues concerning the (n)(D)CG measure have been discussed 
in the literature:
1. Averaging over topics is not statistically reliable because the 
number of relevant documents (recall base size) varies by topics.
2. DCG does not discount the gain when the rank is smaller than the 
logarithm base.
3. Interpretation of non-normalized versions of the measure (CG, DCG) is 
difficult.
4. The measures should be robust when relevance judgements are incomplete.
These issues are discussed briefly below.

(1) Averaging. This issue is discussed by, e.g., by Hull [2] and Sakai 
[9]. Averaging over topics has been considered problematic when a fixed 
length for the result list (i.e. a fixed cut-off value, DCV) is used. 
Say, one evaluates effectiveness at DCV 5. One topic has five relevant 
documents with relevance score 3 retrieved at positions 1 to 5; another 
topic has 100 relevant documents with relevance score 3 and five of them 
are retrieved at positions 1 to 5. The (n)DCG values for the topics will 
be the same at DCV 5. From the point of view of searcher’s effort this 
is correct; however, if the system-oriented effectiveness as an equal 
share of the relevant documents is aimed at, evaluation with fixed DCV 
should be avoided. With high DCVs, e.g. 500 or 1000, the problem is less 
acute but the user-orientation is also lost.

A related problem is caused by the variation in the number of documents 
of different relevance level: As the number of most relevant documents 
tends to be low, weighting them strongly might lead to instability in 
evaluation (a loss of one highly relevant document in the top ranks 
hurts effectiveness badly) [11]. As a remedy large topic pools with 
varying recall bases could be used in evaluation.

One should also bear in mind the difference between (n)DCG curves and 
averaging the scores over topics at a given rank. The curves visualize 
the performance over a range of ranks and are useful to reveal 
crossovers concealed in averages. The single figure averages are needed 
to give concise information and for statistical testing.

(2) Discounting. The original formulation of the (n)DCG does not perform 
discounting for ranks less than the base of the discount logarithm. When 
the base is relatively large, say 10, modeling a patient searcher, the 
10 first ranks avoid discounting. Some scholars have been concerned 
about this feature [10]. It can be solved by modifying the discounting 
factor log_b(i) to the form (1 + log_b(i)) [8]. By doing so the first 
case (with the condition i < b) of the DCG formula can be omitted: all 
ranks are consistently discounted.

(3) Interpretation of the scores. The CG and DCG measures are comparable 
only within one test setting and relevance weighting scheme. The 
relevance grades and their weights should be reported in order to make 
CG and DCG curves interpretable. Further, topics with large recall bases 
yield high CG and DCG figures and affect averages over all topics. The 
normalized versions are recommendable when averaging over topics or 
comparing over test settings is needed.

(4) Robustness. An evaluation measure should be robust with relation to 
missing relevance judgements since, in practice, the whole test 
collection is seldom assessed for relevance. The pooling method for 
gathering documents for relevance judgement, and the later reuse of the 
test collection with new systems or methods leads unavoidably to a 
situation where all documents in the result lists are not assessed for 
relevance. The (n)DCG measure has been found robust in this respect, see 
[1] [10].


REFs

[1] Bompada, T., Chang, C., Chen, J., Kumar, R., and Shenoy, R. (2007). 
On the robustness of relevance measures with incomplete judgments. In 
Proceedings of the 30th Annual international ACM SIGIR Conference on 
Research and Development in Information Retrieval. ACM Press, New York, 
NY, 359-366.

[2] Hull, D. (1993). Using statistical testing in the evaluation of 
retrieval experiments. In Proceedings of the 16th Annual international 
ACM SIGIR Conference on Research and Development in Information 
Retrieval. ACM Press, New York, NY, 329-338.

[8] A suggestion by Susan Price (Portland State University, Portland, 
OR, USA), May 2007 (private communication).

[9] Sakai, T. (2003). Average gain ratio: A simple retrieval performance 
measure for evaluation with multiple relevance levels. In Proceedings of 
the 26th Annual international ACM SIGIR Conference on Research and 
Development in Information Retrieval. ACM Press, New York, NY, 417-418.

[10] Sakai, T. (2007). On the reliability of information retrieval 
metrics based on graded relevance. Information Processing & Management, 
43(2): 531-548.

[11] Voorhees, E. (2001). Evaluation by highly relevant documents. In 
Proceedings of the 24th Annual international ACM SIGIR Conference on 
Research and Development in Information Retrieval. ACM Press, New York, 
NY, 74-82.




Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov