Arman Cohan - Evaluating and Understanding LLMs: From Scientific Reasoning to Alignment as Judges
Автор: uclanlp-plus
Загружено: 2025-12-18
Просмотров: 4
Описание:
Talk Title: Evaluating and Understanding LLMs: From Scientific Reasoning to Alignment as Judges
Abstract: We present our recent work on evaluating and understanding large language models in scientific contexts and understanding them in context of evaluation-generation capabilities. First, we'll introduce SciArena, an open evaluation platform for literature-grounded scientific tasks that uses expert preferences to rank models on long-form, literature-grounded responses. The platform currently supports a broad set of open and proprietary models and has already accumulated a large pool of high-quality preferences. Using these data, we release SciArena-Eval, a meta-evaluation benchmark for training and stress-testing automated judges on science tasks. We will then turn to scientific problem solving. We discuss a holistic suite of scientific reasoning tasks, and a new framework for studying the role of knowledge in scientific problem solving and its interaction with reasoning. Our analysis shows that retrieving task-relevant knowledge from model parameters is the primary bottleneck for science reasoning; in-context external knowledge systematically helps even strong reasoning models; and improved verbalized reasoning increases a model’s ability to surface the right knowledge. Finally, if there is time, we will present a work on generation–evaluation consistency and show that models that judge well also tend to generate outputs that align with human preferences. This enables alignment benchmarking that evaluates models in their role as judges without scoring their generations directly.
To checkout other talks in our full NLP Seminar Series, please visit: • UCLA NLP Seminar Series
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: