- Evaluatology: The Science and Engineering of Evaluation
We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines
- LLM-based NLG Evaluation: Current Status and Challenges
Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, and fine-tuning LLMs with labeled evaluation data In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively
- [2503. 16416] Survey on Evaluation of LLM-based Agents
We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4
- Evaluation: from precision, recall and F-measure to ROC, informedness . . .
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic
- [2305. 12421] Evaluating Open-QA Evaluation - arXiv. org
We introduce a new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA Our evaluation of these methods utilizes human-annotated results to measure their performance
- Evaluating Large Language Models: A Comprehensive Survey
This survey endeavors to offer a panoramic perspective on the evaluation of LLMs We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation
- [2504. 16074] PHYBench: Holistic Evaluation of Physical Perception and . . .
Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items
- Evaluation of Retrieval-Augmented Generation: A Survey
To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems
|