Information Retrieval Evaluation
Donna Harman National Institute of Standards and Technology Abstract Evaluation has always played a major role in information retrieval, with the early pioneers such as Cyril Cleverdon and Gerard Salton laying the foundations for most of the evaluation methodologies in use today. The retrieval community has been extremely fortunate to have such a well-grounded evaluation paradigm during a period when most of the human language technologies were just developing. This lecture has the goal of explaining where these evaluation methodologies came from and how they have continued to adapt to the vastly changed environment in the search engine world today. The lecture starts with a discussion of the early evaluation of information retrieval systems, starting with the Cranfield testing in the early 1960s, continuing with the Lancaster "user" study for MEDLARS, and presenting the various test collection investigations by the SMART project and by groups in Britain. The emphasis in this chapter is on the how and the why of the various methodologies developed. The second chapter covers the more recent "batch" evaluations, examining the methodologies used in the various open evaluation campaigns such as TREC, NTCIR (emphasis on Asian languages), CLEF (emphasis on European languages), INEX (emphasis on semi-structured data), etc. Here again the focus is on the how and why, and in particular on the evolving of the older evaluation methodologies to handle new information access techniques. This includes how the test collection techniques were modified and how the metrics were changed to better reflect operational environments. The final chapters look at evaluation issues in user studies -- the interactive part of information retrieval, including a look at the search log studies mainly done by the commercial search engines. Here the goal is to show, via case studies, how the high-level issues of experimental design affect the final evaluations. Table of Contents: Introduction and Early History / "Batch" Evaluation Since 1992 / Interactive Evaluation / Conclusion
Cited byJiqun Liu. (2022) Toward Cranfield-inspired reusability assessment in interactive information retrieval evaluation. Information Processing & Management 59:5, 103007. Online publication date: 1-Sep-2022. Crossref Qiang Zhang, Qifan Yang, Xujuan Zhang, Wei Wei, Qiang Bao, Jinqi Su, Xueyan Liu. (2022) A multi-label waste detection model based on transfer learning. Resources, Conservation and Recycling 181, 106235. Online publication date: 1-Jun-2022. Crossref David E. Losada, David Elsweiler, Morgan Harvey, Christoph Trattner. (2022) A day at the races. Applied Intelligence 52:5, 5617-5632. Online publication date: 17-Aug-2021. Crossref Elias Bassani. 2022. ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison. Advances in Information Retrieval, 259-264. Crossref Jimmy Lin, Rodrigo Nogueira, Andrew Yates. (2021) Pretrained Transformers for Text Ranking: BERT and Beyond. Synthesis Lectures on Human Language Technologies 14:4, 1-325. Online publication date: 29-Oct-2021. Crossref Mohammad A. Alzubaidi, Mwaffaq Otoom, Nesreen Otoum, Yousef Etoom, Rudaina Banihani. (2021) A Novel Computational Method for Assigning Weights of Importance to Symptoms of COVID-19 Patients. Artificial Intelligence in Medicine, 102018. Online publication date: 1-Jan-2021. Crossref Ian Ruthven. (2020) Resonance and the experience of relevance. Journal of the Association for Information Science and Technology 11. Online publication date: 23-Oct-2020. Crossref Steven R Chamberlin, Steven D Bedrick, Aaron M Cohen, Yanshan Wang, Andrew Wen, Sijia Liu, Hongfang Liu, William R Hersh. (2020) Evaluation of patient-level retrieval from electronic health record data for a cohort discovery task. JAMIA Open 3:3, 395-404. Online publication date: 26-Jul-2020. Crossref William Hersh. 2020. Research. Information Retrieval: A Biomedical and Health Perspective, 337-405. Crossref Kseniya Buraya, Vladislav Grozin, Vladislav Trofimov, Pavel Vinogradov, Natalia Gusarova. 2019. Mining of Relevant and Informative Posts from Text Forums. Electronic Governance and Open Society: Challenges in Eurasia, 154-168. Crossref Nicola Ferro, Carol Peters. 2019. From Multilingual to Multimodal: The Evolution of CLEF over Two Decades. Information Retrieval Evaluation in a Changing World, 3-44. Crossref Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro, Gianmaria Silvello. 2019. An Innovative Approach to Data Management and Curation of Experimental Data Generated Through IR Test Collections. Information Retrieval Evaluation in a Changing World, 105-122. Crossref Nicola Ferro. 2019. What Happened in CLEF $$\ldots $$ For a While?. Experimental IR Meets Multilinguality, Multimodality, and Interaction, 3-45. Crossref Wondwossen Mulualem Beyene, Thomas Godwin. (2018) Accessible search and the role of metadata. Library Hi Tech 36:1, 2-17. Online publication date: 19-Mar-2018. Crossref Zohreh Zahedi, Rodrigo Costas, Paul Wouters. (2017) Mendeley readership as a filtering tool to identify highly cited publications. Journal of the Association for Information Science and Technology 68:10, 2511-2521. Online publication date: 3-Jul-2017. Crossref Christiane Behnert, Dirk Lewandowski. (2017) A framework for designing retrieval effectiveness studies of library information systems using human relevance assessments. Journal of Documentation 73:3, 509-527. Online publication date: 8-May-2017. Crossref Dwaipayan Roy. (2017) An Improved Test Collection and Baselines for Bibliographic Citation Recommendation. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM '17, 2271-2274. Crossref Bilel Moulahi, Lynda Tamine, Sadok Ben Yahia. (2016) When time meets information retrieval: Past proposals, current plans and future trends. Journal of Information Science 42:6, 725-747. Online publication date: 11-Jul-2016. Crossref Peng Zhang, Qian Yu, Yuexian Hou, Dawei Song, Jingfei Li, Bin Hu. (2016) Generalized Analysis of a Distribution Separation Method. Entropy 18:4, 105. Online publication date: 13-Apr-2016. Crossref Nicola Ferro, Gianmaria Silvello, Heikki Keskustalo, Ari Pirkola, Kalervo Järvelin. (2016) The twist measure for IR evaluation: Taking user's effort into account. Journal of the Association for Information Science and Technology 67:3, 620-648. Online publication date: 27-Mar-2015. Crossref Ben Carterette, Paul Clough, Mark Hall, Evangelos Kanoulas, Mark Sanderson. (2016) Evaluating Retrieval over Sessions. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR '16, 685-688. Crossref Gaurav Baruah, Haotian Zhang, Rakesh Guttikonda, Jimmy Lin, Mark D. Smucker, Olga Vechtomova. (2016) Optimizing Nugget Annotations with Active Learning. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM '16, 2359-2364. Crossref Toine Bogers, Georgeta Bordea, Paul Buitelaar, Nicola Ferro, Gianmaria Silvello. 2016. IR Scientific Data: How to Semantically Represent and Enrich Them. Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016, 66-71. Crossref Alexander Stocker, Alexander Richter, Christian Kaiser, Selver Softic. (2015) Exploring barriers of enterprise search implementation: a qualitative user study. Aslib Journal of Information Management 67:5, 470-491. Online publication date: 21-Sep-2015. Crossref Teemu Pääkkönen, Kalervo Järvelin, Jaana Kekäläinen, Heikki Keskustalo, Feza Baskaya, David Maxwell, Leif Azzopardi. 2015. Exploring Behavioral Dimensions in Session Effectiveness. Experimental IR Meets Multilinguality, Multimodality, and Interaction, 178-189. Crossref Nicola Ferro. (2014) CLEF 15th Birthday. ACM SIGIR Forum 48:2, 31-55. Online publication date: 23-Dec-2014. Crossref Aldo Lipani, Florina Piroi, Linda Andersson, Allan Hanbury. 2014. An Information Retrieval Ontology for Information Retrieval Nanopublications. Information Access Evaluation. Multilinguality, Multimodality, and Interaction, 44-49. Crossref Omar Alonso. 2014. Evaluation with Respect to Usefulness. Bridging Between Information Retrieval and Databases, 182-191. Crossref Alexander Stocker, Markus Zoier, Selver Softic, Stefan Paschke, Heimo Bischofter, Roman Kern. (2014) Is enterprise search useful at all?. Proceedings of the 14th International Conference on Knowledge Technologies and Data-driven Business - i-KNOW '14, 1-8. Crossref Julián Urbano, Markus Schedl, Xavier Serra. (2013) Evaluation in Music Information Retrieval. Journal of Intelligent Information Systems 41:3, 345-369. Online publication date: 13-Jul-2013. Crossref Md Zia Ullah, Masaki Aono, Md Hanif Seddiqui. (2013) Estimating a ranked list of human hereditary diseases for clinical phenotypes by using weighted bipartite network. 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 3475-3478. Crossref Omar Alonso. (2013) Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Information Retrieval 16:2, 101-120. Online publication date: 20-Jul-2012. Crossref Diane Kelly, Cassidy R. Sugimoto. (2013) A systematic review of interactive information retrieval evaluation studies, 1967-2006. Journal of the American Society for Information Science and Technology 64:4, 745-770. Online publication date: 1-Mar-2013. Crossref Rao Shen, Marcos Andre Goncalves, Edward A. Fox. (2013) Key Issues Regarding Digital Libraries: Evaluation and Integration. Synthesis Lectures on Information Concepts, Retrieval, and Services 5:2, 1-110. Online publication date: 13-Mar-2013. Abstract | PDF (3483 KB) | PDF Plus (2622 KB) | Supplementary Material Allan Hanbury, Henning Müller, Georg Langs, Bjoern H. Menze. 2013. Cloud–Based Evaluation Framework for Big Data. The Future Internet, 104-114. Crossref Kalervo Järvelin. 2013. User-Oriented Evaluation in IR. Information Retrieval Meets Information Visualization, 86-91. Crossref Donna Harman. 2013. TREC-Style Evaluations. Information Retrieval Meets Information Visualization, 97-115. Crossref Maristella Agosti, Richard Berendsen, Toine Bogers, Martin Braschler, Paul Buitelaar, Khalid Choukri, Giorgio Maria Di Nunzio, Nicola Ferro, Pamela Forner, Allan Hanbury, Karin Friberg Heppin, Preben Hansen, Anni Järvelin, Birger Larsen, Mihai Lupu, Ivano Masiero, Henning Müller, Simone Peruzzo, Vivien Petras, Florina Piroi, Maarten de Rijke, Giuseppe Santucci, Gianmaria Silvello, Elaine Toms, Richard Berendsen, Allan Hanbury, Mihai Lupu, Vivien Petras, Gianmaria Silvello. (2012) PROMISE retreat report prospects and opportunities for information access evaluation. ACM SIGIR Forum 46:2, 60-84. Online publication date: 21-Dec-2012. Crossref Kalervo Järvelin. (2012) IR research. ACM SIGIR Forum 45:2, 17-31. Online publication date: 9-Jan-2012. Crossref Marco Angelini, Nicola Ferro, Kalervo Järvelin, Heikki Keskustalo, Ari Pirkola, Giuseppe Santucci, Gianmaria Silvello. 2012. Cumulated Relative Position: A Metric for Ranking Evaluation. Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, 112-123. Crossref Max L. Wilson. (2011) Search User Interface Design. Synthesis Lectures on Information Concepts, Retrieval, and Services 3:3, 1-143. Online publication date: 17-Nov-2011. Abstract | PDF (8118 KB) | PDF Plus (4314 KB)
|
|
|