Desirable Properties for Diversity and Truncated Effectiveness Metrics


A wide range of evaluation metrics have been proposed to measure the quality of search results, including in the presence of diversification. Some of these metrics have been adapted for use in search tasks with different complexities, such as where the search system returns lists of different lengths. Given the range of requirements, it can be difficult to compare the behavior of these metrics. In this work, we examine effectiveness metrics using a simple property-based approach. In particular, we present a case-analysis framework to define and study fundamental properties that seem integral to any evaluation metric. An example of a simple property is that a ranking with only one non-relevant document should never score lower than a ranking with two non-relevant documents. The framework facilitates quantifying the ability of metrics to satisfy properties, both separately and simultaneously, and to identify those cases where properties are violated. Our analysis shows that the Average Cube Test and Intent-Aware Average Precision are two metrics which fail to satisfy the desirable properties, and hence should be used with caution.

Proceedings of the 23rd Australasian Document Computing Symposium