Re: NDCG
Hi All -
The following is a clip from a doc on (n)(D)CG focusing on some of the
issues in our discussion:
- Kal
-------------------CLIP--------------------
The following issues concerning the (n)(D)CG measure have been discussed
in the literature:
1. Averaging over topics is not statistically reliable because the
number of relevant documents (recall base size) varies by topics.
2. DCG does not discount the gain when the rank is smaller than the
logarithm base.
3. Interpretation of non-normalized versions of the measure (CG, DCG) is
difficult.
4. The measures should be robust when relevance judgements are incomplete.
These issues are discussed briefly below.
(1) Averaging. This issue is discussed by, e.g., by Hull [2] and Sakai
[9]. Averaging over topics has been considered problematic when a fixed
length for the result list (i.e. a fixed cut-off value, DCV) is used.
Say, one evaluates effectiveness at DCV 5. One topic has five relevant
documents with relevance score 3 retrieved at positions 1 to 5; another
topic has 100 relevant documents with relevance score 3 and five of them
are retrieved at positions 1 to 5. The (n)DCG values for the topics will
be the same at DCV 5. From the point of view of searcher’s effort this
is correct; however, if the system-oriented effectiveness as an equal
share of the relevant documents is aimed at, evaluation with fixed DCV
should be avoided. With high DCVs, e.g. 500 or 1000, the problem is less
acute but the user-orientation is also lost.
A related problem is caused by the variation in the number of documents
of different relevance level: As the number of most relevant documents
tends to be low, weighting them strongly might lead to instability in
evaluation (a loss of one highly relevant document in the top ranks
hurts effectiveness badly) [11]. As a remedy large topic pools with
varying recall bases could be used in evaluation.
One should also bear in mind the difference between (n)DCG curves and
averaging the scores over topics at a given rank. The curves visualize
the performance over a range of ranks and are useful to reveal
crossovers concealed in averages. The single figure averages are needed
to give concise information and for statistical testing.
(2) Discounting. The original formulation of the (n)DCG does not perform
discounting for ranks less than the base of the discount logarithm. When
the base is relatively large, say 10, modeling a patient searcher, the
10 first ranks avoid discounting. Some scholars have been concerned
about this feature [10]. It can be solved by modifying the discounting
factor log_b(i) to the form (1 + log_b(i)) [8]. By doing so the first
case (with the condition i < b) of the DCG formula can be omitted: all
ranks are consistently discounted.
(3) Interpretation of the scores. The CG and DCG measures are comparable
only within one test setting and relevance weighting scheme. The
relevance grades and their weights should be reported in order to make
CG and DCG curves interpretable. Further, topics with large recall bases
yield high CG and DCG figures and affect averages over all topics. The
normalized versions are recommendable when averaging over topics or
comparing over test settings is needed.
(4) Robustness. An evaluation measure should be robust with relation to
missing relevance judgements since, in practice, the whole test
collection is seldom assessed for relevance. The pooling method for
gathering documents for relevance judgement, and the later reuse of the
test collection with new systems or methods leads unavoidably to a
situation where all documents in the result lists are not assessed for
relevance. The (n)DCG measure has been found robust in this respect, see
[1] [10].
REFs
[1] Bompada, T., Chang, C., Chen, J., Kumar, R., and Shenoy, R. (2007).
On the robustness of relevance measures with incomplete judgments. In
Proceedings of the 30th Annual international ACM SIGIR Conference on
Research and Development in Information Retrieval. ACM Press, New York,
NY, 359-366.
[2] Hull, D. (1993). Using statistical testing in the evaluation of
retrieval experiments. In Proceedings of the 16th Annual international
ACM SIGIR Conference on Research and Development in Information
Retrieval. ACM Press, New York, NY, 329-338.
[8] A suggestion by Susan Price (Portland State University, Portland,
OR, USA), May 2007 (private communication).
[9] Sakai, T. (2003). Average gain ratio: A simple retrieval performance
measure for evaluation with multiple relevance levels. In Proceedings of
the 26th Annual international ACM SIGIR Conference on Research and
Development in Information Retrieval. ACM Press, New York, NY, 417-418.
[10] Sakai, T. (2007). On the reliability of information retrieval
metrics based on graded relevance. Information Processing & Management,
43(2): 531-548.
[11] Voorhees, E. (2001). Evaluation by highly relevant documents. In
Proceedings of the 24th Annual international ACM SIGIR Conference on
Research and Development in Information Retrieval. ACM Press, New York,
NY, 74-82.
- References:
- NDCG
- From: Ian Soboroff <ian.soboroff@nist.gov>
- RE: NDCG
- From: Stephen Robertson <ser@microsoft.com>
- Re: NDCG
- From: Ian Soboroff <ian.soboroff@nist.gov>
- RE: NDCG
- From: Nick Craswell <nickcr@microsoft.com>
- Re: NDCG
- From: "Tetsuya Sakai (office)" <sakai@newswatch.co.jp>
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov